# Python and pandas &ndash; first steps

This Jupyter notebook collects basic features of Python and the library pandas. You don't need prior coding experience in order to start yourself.

A library is a collection of finished program code that can be used in programs. The basic library of data analytics is called __pandas__. The Pandas library installs along with Anaconda. The Pandas library includes, among other things, the **crosstab** function to calculate number and percentage summaries and the **describe** function to quickly calculate statistical key figures.


If the library is installed, you can use the **import** command to enable it. For example, $$\textrm{import pandas as pd}$$ introduces the pandas library so that in the following code cells it can be referred to by the abbreviation **pd**.

You can run any cell of your notebook with the key combinations **ctrl-enter** (stay in the same cell) or **shift-enter** (move to the next cell).

In [1]:
import pandas as pd

## Reading data

You can use the **read_excel** function in the pandas library to read data from an Excel file. If the file is in the same folder as the Jupyter notebook, then it can be referred to by the filename alone, e.g. '**data1_en.xlsx**'. However, the following file is at a web address. The following reads the Excel format data as the value of the variable called **df**.

In [2]:
df = pd.read_excel('https://myy.haaga-helia.fi/~menetelmat/Data-analytiikka/Teaching/data1_en.xlsx')

The Pandas library has a defined data structure called **dataframe**, which can very well be considered the equivalent of an Excel table. The data object **df** opened above is a dataframe.

## Getting to know the data

The name of the dataframe displays the first and last lines of the dataframe.

In [3]:
df

Unnamed: 0,number,sex,age,family,education,empl_years,salary,management,colleagues,environment,salary_level,duties,occu_health,timeshare,gym,massage
0,1,1,38,1,1.0,22.0,3587,3,3.0,3,3,3,,,,
1,2,1,29,2,2.0,10.0,2963,1,5.0,2,1,3,,,,
2,3,1,30,1,1.0,7.0,1989,3,4.0,1,1,3,1.0,,,
3,4,1,36,2,1.0,14.0,2144,3,3.0,3,3,3,1.0,,,
4,5,1,24,1,2.0,4.0,2183,2,3.0,2,1,2,1.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77,78,1,22,1,3.0,0.0,1598,4,4.0,4,3,4,,1.0,1.0,
78,79,1,33,1,1.0,2.0,1638,1,3.0,2,1,2,1.0,,,
79,80,1,27,1,2.0,7.0,2612,3,4.0,3,3,3,1.0,,1.0,
80,81,1,35,2,2.0,16.0,2808,3,4.0,3,3,3,,,,


The DataFrame object has numerous properties and functions, examples of which we will see later. Note that
- there are no parentheses after the feature name
- there are always parentheses after the name of the function.

In [4]:
### The feature shape includes number of rows and columns

df.shape

(82, 16)

In [5]:
### The function info prints data from data columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82 entries, 0 to 81
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   number        82 non-null     int64  
 1   sex           82 non-null     int64  
 2   age           82 non-null     int64  
 3   family        82 non-null     int64  
 4   education     81 non-null     float64
 5   empl_years    80 non-null     float64
 6   salary        82 non-null     int64  
 7   management    82 non-null     int64  
 8   colleagues    81 non-null     float64
 9   environment   82 non-null     int64  
 10  salary_level  82 non-null     int64  
 11  duties        82 non-null     int64  
 12  occu_health   47 non-null     float64
 13  timeshare     20 non-null     float64
 14  gym           9 non-null      float64
 15  massage       22 non-null     float64
dtypes: float64(7), int64(9)
memory usage: 10.4 KB


The above output shows that this dataframe contains __int64__, integer type information and __float64__, that is, floating-point type information. You can think of floating point numbers as decimal numbers.

If the column had textual information, then it would appear as an __object__ type.

Different filtering is easy to do for data. A column in a dataframe can be referenced by entering the column name inside square brackets, for example _df['salary']_.

In filtering the following operators can be used
- comparison operators **>**, **<**, **>=**, **<=**, **==**, **!=**
- merger operators **&** (and), **|** (or)
- negation operator **~**

In [6]:
### 30 year old

df[df['age']==30]

Unnamed: 0,number,sex,age,family,education,empl_years,salary,management,colleagues,environment,salary_level,duties,occu_health,timeshare,gym,massage
2,3,1,30,1,1.0,7.0,1989,3,4.0,1,1,3,1.0,,,
38,39,1,30,1,2.0,10.0,2300,3,5.0,3,3,4,,,,
43,44,1,30,1,2.0,7.0,2223,2,3.0,4,1,3,1.0,,,1.0


In [7]:
### They whose education is not 1 and the salary is no more than 1700 (note the parentheses)

df[(df['education'] != 1) & (df['salary'] <= 1700) ]

Unnamed: 0,number,sex,age,family,education,empl_years,salary,management,colleagues,environment,salary_level,duties,occu_health,timeshare,gym,massage
25,26,1,26,1,2.0,3.0,1521,2,4.0,2,1,3,1.0,,1.0,1.0
35,36,1,31,2,3.0,0.0,1559,2,4.0,3,1,3,1.0,,,
53,54,1,25,1,2.0,1.0,1559,2,4.0,3,1,2,1.0,,,
75,76,1,37,1,2.0,15.0,1598,1,5.0,1,1,1,1.0,,,
77,78,1,22,1,3.0,0.0,1598,4,4.0,4,3,4,,1.0,1.0,


In [8]:
### Over 50 years of age with less than 2000 salary (note the parentheses)

df[(df['age'] > 50) & (df['salary'] < 2000)]

Unnamed: 0,number,sex,age,family,education,empl_years,salary,management,colleagues,environment,salary_level,duties,occu_health,timeshare,gym,massage
45,46,2,51,2,1.0,28.0,1989,3,3.0,2,2,3,1.0,,,1.0
62,63,2,51,2,2.0,10.0,1872,4,3.0,2,2,3,1.0,,,


In [9]:
### Those who have not used occupational health care and whose salary is below 2000

df[ ~(df['occu_health'] == 1) & (df['salary'] < 2000)]

Unnamed: 0,number,sex,age,family,education,empl_years,salary,management,colleagues,environment,salary_level,duties,occu_health,timeshare,gym,massage
46,47,2,22,1,3.0,21.0,1872,3,3.0,4,1,3,,,1.0,
77,78,1,22,1,3.0,0.0,1598,4,4.0,4,3,4,,1.0,1.0,


Additional information can be given to functions as parameters. For example, for the __nsmallest__ function, possible parameters can be found on the function's help page https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nsmallest.html.

Terminology: __n__ and __columns__ are parameters of the nsmallest function that can be assigned a value. The value given to the parameter is called an _argument_. In the following __3__ is the value of parameter __n__, that is, the argument.

In [10]:
### Rows with the three lowest salaries

df.nsmallest(3, 'salary')

Unnamed: 0,number,sex,age,family,education,empl_years,salary,management,colleagues,environment,salary_level,duties,occu_health,timeshare,gym,massage
25,26,1,26,1,2.0,3.0,1521,2,4.0,2,1,3,1.0,,1.0,1.0
35,36,1,31,2,3.0,0.0,1559,2,4.0,3,1,3,1.0,,,
53,54,1,25,1,2.0,1.0,1559,2,4.0,3,1,2,1.0,,,


In [11]:
### Rows with the five highest ages

df.nlargest(5, 'age')

Unnamed: 0,number,sex,age,family,education,empl_years,salary,management,colleagues,environment,salary_level,duties,occu_health,timeshare,gym,massage
56,57,1,61,2,2.0,36.0,3119,2,,2,1,5,1.0,,,1.0
32,33,1,59,2,3.0,15.0,6278,4,4.0,5,4,4,,1.0,,
13,14,1,58,2,3.0,21.0,3587,4,5.0,4,1,3,,,,
27,28,2,56,1,1.0,15.0,2223,3,4.0,3,2,4,1.0,,,1.0
36,37,2,56,2,2.0,17.0,2729,5,5.0,5,5,5,,,,1.0


In [12]:
### the unique function prints the unique values of the column (the order comes from the data)

df['age'].unique()

array([38, 29, 30, 36, 24, 31, 49, 55, 40, 33, 39, 35, 58, 53, 42, 26, 47,
       44, 43, 56, 21, 45, 59, 37, 28, 50, 32, 51, 22, 34, 27, 41, 25, 61,
       20, 52, 46], dtype=int64)

## Statistical key figures

The __describe__ function calculates statistical key figures:
- number of values __count__
- __mean__
- standard deviation __std__
- smallest __min__
- lower quartile __25%__; a quarter of the values are at most the lower quartile
- median __50%__
- upper quartile __75%__; a quarter of the values are at least the upper quartile
- largest __max__

In [13]:
df.describe()

Unnamed: 0,number,sex,age,family,education,empl_years,salary,management,colleagues,environment,salary_level,duties,occu_health,timeshare,gym,massage
count,82.0,82.0,82.0,82.0,81.0,80.0,82.0,82.0,81.0,82.0,82.0,82.0,47.0,20.0,9.0,22.0
mean,41.5,1.231707,37.95122,1.621951,1.987654,12.175,2563.878049,3.060976,4.061728,3.219512,2.109756,3.195122,1.0,1.0,1.0,1.0
std,23.815261,0.424519,9.773866,0.487884,0.844006,8.807038,849.350302,1.058155,0.826826,1.154961,1.111179,1.047502,0.0,0.0,0.0,0.0
min,1.0,1.0,20.0,1.0,1.0,0.0,1521.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,21.25,1.0,31.0,1.0,1.0,3.75,2027.0,2.0,4.0,3.0,1.0,3.0,1.0,1.0,1.0,1.0
50%,41.5,1.0,37.5,2.0,2.0,12.5,2320.0,3.0,4.0,3.0,2.0,3.0,1.0,1.0,1.0,1.0
75%,61.75,1.0,44.0,2.0,3.0,18.25,2808.0,4.0,5.0,4.0,3.0,4.0,1.0,1.0,1.0,1.0
max,82.0,2.0,61.0,2.0,4.0,36.0,6278.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0


In [14]:
### statistical key figures of salary

df['salary'].describe()

count      82.000000
mean     2563.878049
std       849.350302
min      1521.000000
25%      2027.000000
50%      2320.000000
75%      2808.000000
max      6278.000000
Name: salary, dtype: float64

## Frequency table

In the following, the result given by the __crosstab__ function is placed as the value of the variable called __df1__. In this case, the result is not displayed unless the name of the dataframe is given separately.

More information about the parameters of the crosstab function can be found at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html.

In [15]:
df1 = pd.crosstab(df['education'], 'f')
df1

col_0,f
education,Unnamed: 1_level_1
1.0,27
2.0,30
3.0,22
4.0,2


Crosstab calculates frequencies, that is, the number of times of appearances. For example, training 1 has been completed by 27 people.

The table produced by the crosstab function is a dataframe as is the original data. The dataframe always consists of three parts:
- __index__, shown in bold on the left side
- column headers __columns__, shown in bold at the top
- actual data; one column can contain only one type of data (for example, integers).

Above, the values 1, 2, 3 and 4 that appeared in the education column are the index. The column headings are only f. The index has a name (education) and column headers have a name (col_0).

In [16]:
### Let's change the name of the column headers to an empty string.

df1.columns.name = ''
df1

Unnamed: 0_level_0,f
education,Unnamed: 1_level_1
1.0,27
2.0,30
3.0,22
4.0,2


__List__ is a Python data structure that is written inside square brackets. In the following, a list is defined and placed in the index of the dataframe (__index__).

In [17]:
education = ['Comprehensive school', 'Secondary school', 'University level degree', 'Master level degree']

df1.index =education

df1

Unnamed: 0,f
Comprehensive school,27
Secondary school,30
University level degree,22
Master level degree,2


If a column is referenced that does not yet exist, one is created. In the following example, a column is created in which percentages are calculated.

In [18]:
### Number of observations

n = df1['f'].sum()

### Add a column % for percentages

df1['%'] = df1['f']/n*100

df1

Unnamed: 0,f,%
Comprehensive school,27,33.333333
Secondary school,30,37.037037
University level degree,22,27.160494
Master level degree,2,2.469136


To create a new row, refer to the new row with the loc property. In the following, the Total row is calculated with the sum function.

In [19]:
df1.loc['Total'] = df1.sum()
df1

Unnamed: 0,f,%
Comprehensive school,27.0,33.333333
Secondary school,30.0,37.037037
University level degree,22.0,27.160494
Master level degree,2.0,2.469136
Total,81.0,100.0


__Warning!__  Try what happens in the Total row if you run the cell above again? After that, you should run all cells of the notebook again using the __Kernel__ menu command __Restart & Run All__.

You can avoid the damage according to the above-mentioned warning by using _df1.loc['Comprehensive school':'Master level degree'].sum()_ instead of _df1.sum()_ 

The crosstab function can also be used to calculate a cross-tabulation. In the following, the cross-tabulation between education and gender is calculated.

In [20]:
### Calculate the cross-tabulation of education and gender

df2 = pd.crosstab(df['education'], df['sex'])

### View the result

df2

sex,1,2
education,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,22,5
2.0,23,7
3.0,15,7
4.0,2,0


The textual values of the education column (previously defined list) can be placed in the index. We also create a list for genders and place its values in the column headers.

In [21]:
gender = ['Man', 'Woman']

df2.index = education
df2.columns = gender
df2

Unnamed: 0,Man,Woman
Comprehensive school,22,5
Secondary school,23,7
University level degree,15,7
Master level degree,2,0


The index in the dataframe may be named. Similarly, column headers may be named.

In [22]:
df2.index.name = 'Education'
df2.columns.name = 'Gender'
df2

Gender,Man,Woman
Education,Unnamed: 1_level_1,Unnamed: 2_level_1
Comprehensive school,22,5
Secondary school,23,7
University level degree,15,7
Master level degree,2,0


## Dataframe managing

Along with the list, the __dictionary__ is another key data structure in Python. The dictionary is written inside curly brackets and consists of pairs. The members of the pair are separated by a colon.

The following uses a dictionary to rename two variables.

In [23]:
### Renaming variable using the dictionary data structure

df = df.rename(columns={'sex':'gender', 'empl_years':'years_of_service'})

### View the column names
df.columns

Index(['number', 'gender', 'age', 'family', 'education', 'years_of_service',
       'salary', 'management', 'colleagues', 'environment', 'salary_level',
       'duties', 'occu_health', 'timeshare', 'gym', 'massage'],
      dtype='object')

__Warning!__ If, after renaming the variables, you return to a previous cell that refers to the sex or empl_years variable, then running the cell will result in an error message. In such a situation, you should re-run the codes of all cells using the __Kernel__ menu command __Restart & Run All__.

Change the order of the index using the reindex function. Note that the new order will not be saved in the dataframe unless it is specifically saved as the value of df2!

In [24]:
### Change the order of the index

df2.reindex(['Comprehensive school', 'Master level degree', 'University level degree', 'Secondary school'])

Gender,Man,Woman
Education,Unnamed: 1_level_1,Unnamed: 2_level_1
Comprehensive school,22,5
Master level degree,2,0
University level degree,15,7
Secondary school,23,7


To change the order of the columns, the axis parameter is given a value of one.

In [25]:
### Change the order of the columns

df2.reindex(['Woman', 'Man'], axis=1)

Gender,Woman,Man
Education,Unnamed: 1_level_1,Unnamed: 2_level_1
Comprehensive school,5,22
Secondary school,7,23
University level degree,7,15
Master level degree,0,2


Move the index out of the index; and replace it with the default index 0, 1,...

In [26]:
### Replace the index with the default index.

df2 = df2.reset_index()
df2

Gender,Education,Man,Woman
0,Comprehensive school,22,5
1,Secondary school,23,7
2,University level degree,15,7
3,Master level degree,2,0


The old index can always be restored.

In [27]:
### Restoring Education as an index.

df2 = df2.set_index('Education')
df2

Gender,Man,Woman
Education,Unnamed: 1_level_1,Unnamed: 2_level_1
Comprehensive school,22,5
Secondary school,23,7
University level degree,15,7
Master level degree,2,0


The **loc** property can be used to refer to “slices” of data using index values and column headers.

In [28]:
### How many men are there with secondary education?

df2.loc['Secondary school', 'Man']

23

A mere colon denotes all rows or all columns.

In [30]:
# Data from the 'Women' column in df2
df2.loc[:, 'Woman']

Education
Comprehensive school       5
Secondary school           7
University level degree    7
Master level degree        0
Name: Woman, dtype: int64

Above, there is data in only one column, and therefore the result is not of type dataframe, but __series__. However, the result can be converted into a dataframe.

In [31]:
df2.loc[:, 'Woman'].to_frame()

Unnamed: 0_level_0,Woman
Education,Unnamed: 1_level_1
Comprehensive school,5
Secondary school,7
University level degree,7
Master level degree,0


In [32]:
### All columns are printed, Master level degree is omitted from Education.

df2.loc['Comprehensive school':'University level degree', :]

Gender,Man,Woman
Education,Unnamed: 1_level_1,Unnamed: 2_level_1
Comprehensive school,22,5
Secondary school,23,7
University level degree,15,7


## Classified distribution

In order to determine the class limits, the minimum and maximum age is checked.

In [34]:
### Statistical key figures of the column variable age:

df['age'].describe()

count    82.000000
mean     37.951220
std       9.773866
min      20.000000
25%      31.000000
50%      37.500000
75%      44.000000
max      61.000000
Name: age, dtype: float64

Classes can be formed using the __cut__ function of the pandas library. First specify class boundaries as a list.

In [35]:
age_groups = [19, 29, 39, 49, 59, 69]

df['age_group'] = pd.cut(df['age'], age_groups)

### Age groups can be now found in the last column.

df

Unnamed: 0,number,gender,age,family,education,years_of_service,salary,management,colleagues,environment,salary_level,duties,occu_health,timeshare,gym,massage,age_group
0,1,1,38,1,1.0,22.0,3587,3,3.0,3,3,3,,,,,"(29, 39]"
1,2,1,29,2,2.0,10.0,2963,1,5.0,2,1,3,,,,,"(19, 29]"
2,3,1,30,1,1.0,7.0,1989,3,4.0,1,1,3,1.0,,,,"(29, 39]"
3,4,1,36,2,1.0,14.0,2144,3,3.0,3,3,3,1.0,,,,"(29, 39]"
4,5,1,24,1,2.0,4.0,2183,2,3.0,2,1,2,1.0,,,,"(19, 29]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77,78,1,22,1,3.0,0.0,1598,4,4.0,4,3,4,,1.0,1.0,,"(19, 29]"
78,79,1,33,1,1.0,2.0,1638,1,3.0,2,1,2,1.0,,,,"(29, 39]"
79,80,1,27,1,2.0,7.0,2612,3,4.0,3,3,3,1.0,,1.0,,"(19, 29]"
80,81,1,35,2,2.0,16.0,2808,3,4.0,3,3,3,,,,,"(29, 39]"


In [36]:
### Classified distribution

df3 = pd.crosstab(df['age_group'], 'f')
df3.columns.name = ''
df3

Unnamed: 0_level_0,f
age_group,Unnamed: 1_level_1
"(19, 29]",17
"(29, 39]",30
"(39, 49]",23
"(49, 59]",11
"(59, 69]",1


Note here the (standard) notation: (a,b] means that a is not included in the group, but b is included.

## Group by

Statistical key figures can also be calculated in groups determined by the variable. Data is divided into groups using the groupby function.

In the following, statistical key figures for the salary are calculated in the groups determined by the education.

In [37]:
df4 = df.groupby('education')['salary'].describe()

### The previously defined education list contains the textual values of the education levels.

df4.index = education

df4

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Comprehensive school,27.0,2310.481481,473.108653,1638.0,2008.0,2144.0,2534.0,3587.0
Secondary school,30.0,2403.1,533.984395,1521.0,2008.25,2378.5,2729.0,3510.0
University level degree,22.0,2887.227273,1108.404384,1559.0,2222.25,2710.0,2925.0,6278.0
Master level degree,2.0,5147.0,110.308658,5069.0,5108.0,5147.0,5186.0,5225.0


The dataframe can also be transposed.

In [38]:
### In the following df4 is transposed.

df4.T

Unnamed: 0,Comprehensive school,Secondary school,University level degree,Master level degree
count,27.0,30.0,22.0,2.0
mean,2310.481481,2403.1,2887.227273,5147.0
std,473.108653,533.984395,1108.404384,110.308658
min,1638.0,1521.0,1559.0,5069.0
25%,2008.0,2008.25,2222.25,5108.0
50%,2144.0,2378.5,2710.0,5147.0
75%,2534.0,2729.0,2925.0,5186.0
max,3587.0,3510.0,6278.0,5225.0


## List of functions and features used in the notebook

- The __string__ is written between apostrophes ' and ' (also quotation marks " and " work). Strings include file names, dataframe column headers (data variable names), textual values of variables. Note, however, that Python variable names are not written between single quotes (for example, df, n, etc.)

- Import library with __import__ function; as an example, importing the __pandas__ library.

- Python's data structure __list__ that is written inside square bracket
- Python's data structure __dictionary__, which is written inside curly brackets
- The Pandas library data structures __dataframe__ and __series__

- __pd.read_excel()__ reading data from an Excel file
- __df.shape__ the number of rows and columns of the data
- __df.info()__ the number of values and data types
- __df.describe()__ statistical key figures
- __df.nsmallest()__ n smallest with respect to the named column
- __df.nlargest()__ n largest with respect to the named column
- __df[].unique()__ a list of unique values for the named column
- __pd.crosstab()__ number and percentage summaries
- __df.columns__ column headers
- __df.index__ row headers
- __df.columns.name__ the name of the column headers
- __df.index.name__ the name of the row headers
- __df.rename()__ renaming column headers
- __df.reindex()__ rearranging column headers or index
- __df.reset_index()__ moving the index into a regular column (the default index is then used)
- __df.set_index()__ moving a column to the index
- __df.loc[]__ referring to the “slice” of the dataframe
- __df.to_frame()__ convert series into dataframe
- __pd.cut()__ classification of values
- __df.groupby()__ grouping
- __df.T__ transposing the dataframe; i.e. swapping rows and columns

In case of error situations, it often helps to execute the code cells again with the __Restart & Run All__ command in the __Kernel__ menu. When opening a previously started memo, the first thing to do is run the code cells with the __Restart & Run All__ command in the __Kernel__ menu.

Source and origin of inspiration:<br /> 
Aki Taanila: Data-analytiikka Pythonilla: <a href="https://tilastoapu.wordpress.com/python/">https://tilastoapu.wordpress.com/python/</a>

In [1]:
import datetime
print(f'Last modified {datetime.datetime.now():%Y-%m-%d %H:%M} by Juha Nurmonen')

Last modified 2023-03-28 21:44 by Juha Nurmonen
