# Numbers and percentages with value_counts()

Frequence tables and cross tabbings can be calculated using the command <strong>crosstab()</strong>. The command <strong>value_counts()</strong> also has many advantages. In the following some tricks are introduced.

In [13]:
### Let's import the library <em>pandas</em> and call it pd

import pandas as pd

### Opening the data in the dataframe called df

df = pd.read_excel('https://myy.haaga-helia.fi/~menetelmat/Data-analytiikka/Teaching/data1_en.xlsx')

### Let's add one variable with type object to the data

df['duties_obj'] = df['duties'].replace({1 : 'Very unsatisfied', 2 : 'Unsatisfied', 3: 'Not satisfied or unsatisfied', 4 : 'Satisfied', 5 : 'Very satisfied'})

### Now there are integer (int64), floating-point number (float64) and object type variables in the data. 

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82 entries, 0 to 81
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   number        82 non-null     int64  
 1   sex           82 non-null     int64  
 2   age           82 non-null     int64  
 3   family        82 non-null     int64  
 4   education     81 non-null     float64
 5   empl_years    80 non-null     float64
 6   salary        82 non-null     int64  
 7   management    82 non-null     int64  
 8   colleagues    81 non-null     float64
 9   environment   82 non-null     int64  
 10  salary_level  82 non-null     int64  
 11  duties        82 non-null     int64  
 12  occu_health   47 non-null     float64
 13  timeshare     20 non-null     float64
 14  gym           9 non-null      float64
 15  massage       22 non-null     float64
 16  duties_obj    82 non-null     object 
dtypes: float64(7), int64(9), object(1)
memory usage: 11.0+ KB


## Order compliant with frequencies

In [14]:
### As default value_counts() orders frequencies in descending order

df['environment'].value_counts()

3    30
4    23
5    11
1     9
2     9
Name: environment, dtype: int64

In [15]:
### Output is not a dataframe, but it can be changed to be using the command <strong>to_frame()</strong>

df['environment'].value_counts().to_frame()

Unnamed: 0,environment
3,30
4,23
5,11
1,9
2,9


In [16]:
### With parameter ascending frequencies are ordered in the corresponding order

df['environment'].value_counts(ascending = True).to_frame()

Unnamed: 0,environment
1,9
2,9
5,11
4,23
3,30


## Order compliant with variable

In [17]:
### sort_index() orders the output according to the variable and in ascending order

df['environment'].value_counts().sort_index().to_frame()

Unnamed: 0,environment
1,9
2,9
3,30
4,23
5,11


In [18]:
### The opposite ordering is also possible by setting the parameter ascending to False

df['environment'].value_counts().sort_index(ascending = False).to_frame()

Unnamed: 0,environment
5,11
4,23
3,30
2,9
1,9


In [19]:
### sort_index() uses alphabetical ordering with object type variables
### This seldom is what we want. An example:

df['duties_obj'].value_counts().sort_index().to_frame()

Unnamed: 0,duties_obj
Not satisfied or unsatisfied,29
Satisfied,25
Unsatisfied,15
Very satisfied,8
Very unsatisfied,5


In [20]:
### A desired ordering is obtained using a list and the command reindex().
### Note: the list must consist of the exact same values as the variable originally did.

satisfactions = ['Very unsatisfied', 'Unsatisfied', 'Not satisfied or unsatisfied', 'Satisfied', 'Very satisfied']

df['duties_obj'].value_counts().reindex(satisfactions).to_frame()

Unnamed: 0,duties_obj
Very unsatisfied,5
Unsatisfied,15
Not satisfied or unsatisfied,29
Satisfied,25
Very satisfied,8


## Displaying missing values

In [21]:
### Missing values can also be displayed

df['education'].value_counts(dropna = False).sort_index().to_frame()

Unnamed: 0,education
1.0,27
2.0,30
3.0,22
4.0,2
,1


## Displaying percentages

In [22]:
### First we calculate frequencies of the variable education into the dataframe df1

df1 = df['education'].value_counts().sort_index().to_frame()

### Then we add column for percentages.
### Results are shown as percentages by setting the parameter normalize as True.

df1['%'] = df['education'].value_counts(normalize = True) * 100

### Next text values, the kind what people understand, are given to variable values. See also variable descriptions in the Excel file.
education = ['Comprehensive school level', 'Upper secondary education', 'Academic degree', 'Higher academic degree']
df1.index = education

### Add a row for total
df1.loc['Total'] = df1.sum()

### The dataframe df1 is now as displayed
df1

Unnamed: 0,education,%
Comprehensive school level,27.0,33.333333
Upper secondary education,30.0,37.037037
Academic degree,22.0,27.160494
Higher academic degree,2.0,2.469136
Total,81.0,100.0


## Categorical distribution

In [24]:
### We first define boundaries between categories.

bins = [1500, 2000, 2500, 3000, 8000]

df2 = df['salary'].value_counts(bins = bins).sort_index().to_frame()

df2.loc['Total'] = df2.sum()

df2

Unnamed: 0,salary
"(1499.999, 2000.0]",19
"(2000.0, 2500.0]",28
"(2500.0, 3000.0]",22
"(3000.0, 8000.0]",13
Total,82


## Cross tabulation

In [27]:
### In cross tabulation the command groupby() can be utilized.
### In the following code used unstack transfers the values of the categorical variable sex to the columns.

df3= df.groupby('sex')['education'].value_counts().sort_index().unstack('sex')

### Next we define the row labels.
df3.index = education
df3.loc['Total'] = df3.sum()

# Here we redefine the column labels.
df3.columns=['Male', 'Female']

df3


Unnamed: 0,Male,Female
Comprehensive school level,22.0,5.0
Upper secondary education,23.0,7.0
Academic degree,15.0,7.0
Higher academic degree,2.0,
Total,62.0,19.0


## Several frequencies with same range into one table

In [28]:
### Frequencies of the first variable

df4 = df['management'].value_counts(sort = False, normalize = True).to_frame()

### Next frequencies of other variables are added.

df4['colleagues'] = df['colleagues'].value_counts(sort = False, normalize = True)
df4['environment'] = df['environment'].value_counts(sort = False, normalize = True)
df4['salary_level'] = df['salary_level'].value_counts(sort = False, normalize = True)
df4['duties'] = df['duties'].value_counts(sort = False, normalize = True)

### We use text formulation of the numbers in range from the list satisfactions above.

df4.index = satisfactions

df4.loc['Total'] = df4.sum()

### The following code adds the total of columns to the column headers

for var in df4.columns:
    df4 = df4.rename(columns = {var : var + ', n =' + str(df[var].count())})
    
df4 * 100


Unnamed: 0,"management, n =82","colleagues, n =81","environment, n =82","salary_level, n =82","duties, n =82"
Very unsatisfied,8.536585,,10.97561,40.243902,6.097561
Unsatisfied,19.512195,3.703704,10.97561,23.170732,18.292683
Not satisfied or unsatisfied,36.585366,19.753086,36.585366,23.170732,35.365854
Satisfied,28.04878,43.209877,28.04878,12.195122,30.487805
Very satisfied,7.317073,33.333333,13.414634,1.219512,9.756098
Total,100.0,100.0,100.0,100.0,100.0


### Further information

<ul>
    <li><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html">
        https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html </a></li>    
</ul>

Source and origin of inspiration: <br />
Aki Taanila: Data-analytiikka Pythonilla: <a href="https://tilastoapu.wordpress.com/python/">https://tilastoapu.wordpress.com/python/</a>