# **Introduction to Python. Day 3**

## *Dr Kirils Makarovs*

## *k.makarovs@exeter.ac.uk*

## *University of Exeter Q-Step Centre*

---


# **Welcome to Day 3!**

## **By now, you should be familiar with:**

+ The overall workflow of Jupyter Notebooks in Google Colab
+ The basics of Python syntax and operations with lists
+ How to read in external datasets
+ How to navigate datasets - subsetting, accessing rows/columns
+ Operations with variables i.e. recoding, creating new variables


## **Today, we are going to look at the exploratory data analysis, i.e.:**

+ Frequency tables, crosstabs
+ Descriptive statistics for numeric variables
+ Aggregated statistics

---



# **1. Preparing to work in Python**

In [None]:
# Import the necessary libraries

import pandas as pd # data analysis and management library
import numpy as np # multi-dimensional arrays
import math # library with math-related commands like square root, etc.
import random # random number generator via random.sample()


In [None]:
# Mount your Google Drive

# Mounting your Google Drive will enable you to access files from Drive in Google Colab e.g. datasets, notebooks, etc.

from google.colab import drive

# This will prompt for authorization. Enter your authorization code and rerun the cell

drive.mount('/content/drive')


---

# **2. Random helpful commands**



In [None]:
# Getting help for Python commands

# There are several ways to look up the meaning of pandas/numpy functions and methods, as well to see their arguments

# 1) Google it (seriously)
# 2) Type and run '? <your command here>' e.g. ? pd.read_csv
# 3) Start typing the command and then hover the mouse over it


In [None]:
# Let's create a tiny dataset

df = pd.DataFrame(data = {'col1' : [1,2,3],
                          'col2' : [4,5,6],
                          'col3' : [7,8,9],
                          'col4' : [10,11,12]})

df


In [None]:
# Renaming the variables (or rows)

# you can rename column or rows with pandas .rename() method

df.rename(columns = {'col1' : 'column_1',
                     'col2' : 'column_2'}) # old_name : new_name

# you can change the names of indeces by using 'index =' instead of 'columns =' 

# Note the 'inplace' argument! It can be either True or False (default is False)
# It specifies whether to return a new DataFrame. If True then value of copy is ignored.

df.rename(columns = {'col1' : 'column_1',
                     'col2' : 'column_2'},
          inplace = True) # old_name : new_name

df


In [None]:
# Dropping the variables (or rows)

df.drop(labels = ['col3', 'col4'], axis = 1, inplace = True)

# axis can be 0 or 1, default 0
# Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).

df


In [None]:
# Replacing values in a dataframe using pandas .replace() method

df.replace(to_replace = [1,2,3], value = [1000, 2000, 3000], inplace = True)

df


---

# **3. Frequency tables and crosstabs**

<figure>
<left>
<img src=https://miro.medium.com/max/481/1*n_ms1q5YoHAQXXUIfeADKQ.png  width="450">
</figure>

In [None]:
# Let's get the dataset first!

# Don't forget that your pathway to file will be different from mine
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Intro_to_python/Day_3/pokemon.csv')

df.head()

# Data source: https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6


In [None]:
df.shape # dimensions of the dataset

df.info()


## **Frequency tables**

In [None]:
# Let's first get the frequency table for the 'Type 1' variable.
# We can use pandas .value_counts() method for this

df['Type 1'].value_counts() # sorts in descending order by default

df['Type 1'].value_counts(ascending = True) # sort in ascending orders

df['Type 1'].value_counts(sort = False) # don't sort by frequency

df['Type 1'].value_counts(normalize = True) # get proportions instead of frequencies

df['Type 1'].value_counts(normalize = True).round(2) * 100 # round proportions to two decimals and multiply by 100 to get percentage

(df['Type 1'].value_counts(normalize = True).round(2) * 100).sum() # 100% indeed


In [None]:
# If you pass more than 1 variable to .value_counts()
# you get how many observations are there in each combination of values

# Note that if you pass more than variable, they should go as list i.e. ['Type 1', 'Legendary']

df[['Type 1', 'Legendary']].value_counts()

# Same number of observations, 800 pokemons in total

df['Type 1'].value_counts().sum() == df[['Type 1', 'Legendary']].value_counts().sum()


In [None]:
# Note that the strength of Python and any programming language basically is that instead of
# writing same lines of code for repetetive oeprations i.e. showing frequency tables
# for more than one variable, you can optimize it and make Python do it for you

# We don't cover it in this course because it's a more advanced stuff, but just for you to know
# that if you want to show frequency tables for two variables, instead of writing:

# df['Type 1'].value_counts()
# df['Legendary'].value_counts()

# You can use either .apply() method and apply value_counts() to
# two or more variables in the following wat:

df[['Type 1', 'Legendary']].apply(pd.Series.value_counts)

# Or use a for loop, in which for each variable you can consequently print out:
# A line with variable's name
# A frequency table
# A line break

for variable in df[['Type 1', 'Legendary']]:
  print(f'Frequency table for {variable} variable')
  print(df[variable].value_counts())
  print('\n') # \n denotes a line break in Python


## **Crosstabs**

In [None]:
# You can get a crosstab in pandas via .crosstab() method

# Crosstab of Generation and Legendary variables

pd.crosstab(df['Generation'], df['Legendary'])

pd.crosstab(df['Generation'], df['Legendary'], margins = True) # Add row/column margins (subtotals)

pd.crosstab(df['Generation'], df['Legendary'],
            margins = True, normalize = 'index') # row proportions, assuming that Generation affects Legendary status

pd.crosstab(df['Generation'], df['Legendary'],
            margins = True, normalize = 'columns') # column proportions, assuming that Legendary status affects Pokemon's generation

# Round the numbers and multiply by 100

# Would you say that Pokemon's generation affects its Legendary status?

pd.crosstab(df['Generation'], df['Legendary'],
            margins = True, normalize = 'index').round(2) * 100


---

# **3. Descriptive statistics for numeric variables**


In [None]:
# Let's ask pandas to round up all numbers to two decimals

pd.set_option("display.precision", 2)


In [None]:
# If you run .describe() method on the entire dataframe, it will pick out numeric variables
# and provide following statistics: count, mean, std, min, max, 25% and 75% percentile, and median (50% percentile)

df.describe()

# You can transpose the matrix if you wish

df.describe().transpose()

# Works on separate columns too:

df['HP'].describe()


In [None]:
# This is how you can get all these statistics separately:

df['HP'].count() # counts non-missing cells

df['HP'].mean() # mean value

df['HP'].std() # standard deviation

df['HP'].min() # minimum value

df['HP'].max() # maximum value

df['HP'].median() # median value

df['HP'].quantile(q = [0.25, 0.75]) # 25% and 75% quantiles


In [None]:
# Works for more than one variable too

df[['Attack', 'Defense']].mean()

# And for more than one statistic as well, but in this case you need to use .agg() method
# Statistics are supplied as a list and each one of them should be in speech marks

df[['Attack', 'Defense']].agg(['count', 'mean', 'std', 'median'])


---

# **4. Aggregated statistics**

<figure>
<left>
<img src=https://miro.medium.com/max/1400/0*XVlrOuSBNKwIZpPj.png width="500">
</figure>

**[Image source](https://towardsdatascience.com/7-pandas-functions-that-i-use-the-most-b83ddbaf53bf)**

In [None]:
# Aggregated analysis implies that you get some statistic (e.g. mean or median) of your main variable
# separately for groups of observations defined by some other variable.

# (also known as Split-Apply-Combine technique)

# Your main variable should continuous, whereas grouping variable - categorical. 

# For example:

# mean income for men and women
# median level of life satisfaction for young, middle-aged, and elderly people
# 25% and 75% percentiles of anxiety scale for 1st, 2nd, and 3rd-year students

# The process of getting aggregated statistics requires two steps:
# You first group your data via .groupby() method,
# and then get aggregated values

# Getting the mean HP level for legendary and non-legendary pokemons

df.groupby('Legendary')['HP'].mean()

# Getting the median Attack level per Pokemon generation

df.groupby('Generation')['Attack'].median()

# You can group by more than one variable:

df.groupby(['Generation', 'Legendary'])['Attack'].median()

# And get statistic for more than one outcome variable

df.groupby(['Generation', 'Legendary'])[['Attack', 'Defense']].median()

# It is also possible to get more than one statistic simultaneously,
# but then they need to be wrapper up in the .agg() method

# Let say that for each generation of Pokemons, I want to know
# the mean, median, and standard deviation of their HP, Attack, and Defense points

df.groupby('Generation')[['HP', 'Attack', 'Defense']].agg(['mean', 'median', 'std'])


## **Exercise**

Alright, it's time to practice!


In [None]:
# Let's get the mtcars dataset first

# Don't forget that your pathway to file will be different from mine
mtcars = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Intro_to_python/Day_2/mtcars.csv')

mtcars.head(15)

# Motor Trend Car Road Tests data
# The data was extracted from the 1974 Motor Trend US magazine,
# and comprises fuel consumption and 10 aspects of automobile design and performance
# for 32 automobiles (1973–74 models).

# mpg - Miles/(US) gallon
# cyl - Number of cylinders
# disp - Displacement (cu.in.)
# hp - Gross horsepower
# drat - Rear axle ratio
# wt - Weight (1000 lbs)
# qsec - 1/4 mile time
# vs - Engine (0 = V-shaped, 1 = straight)
# am - Transmission (0 = automatic, 1 = manual)
# gear - Number of forward gears
# carb - Number of carburetors


Using the `mtcars` dataset, please answer the following questions:

+ How many cars have **8 cylinders**?
+ Among all cars, what is the proportion of those having **4 gears**?
+ Among all cars, what is the proportion of those having **more than 1 carburetor**?
+ Among cars with **automatic transmission**, how many of them have a **straight engine**?
+ Among cars with **6 cylinders**, what is the proportion of those having **gross horsepower of higher than 150**?
+ Among all cars, what is the **mean 1/4 mile time**?
+ If we take only those cars that **weight more than 75% of other cars** and have a **V-shaped engine**, what would be their **mean 1/4 mile time**?
+ Get the **mean value of miles per gallon** for cars with different **number of carburetors**. Which group has the highest average mileage?
+ Get the **mean and median values of displacement** for cars with different **number of cylinders**. For which group/groups, median value is higher than the mean? 
+ Finally, say we're interested in comparing **Mercedes vs all other cars**. Is it true, that for Mercedes the **mean 1/4 mile time** is always higher than for other cars, regardless of how many **gears** they have?

In [None]:
mtcars.head()

In [None]:
# How many cars have 8 cylinders?

mtcars['cyl'].value_counts()


In [None]:
# Among all cars, what is the proportion of those having 4 gears?

mtcars['gear'].value_counts(normalize = True)


In [None]:
# Among all cars, what is the proportion of those having more than 1 carburetor?

# Approach #1

mtcars['carb_rec'] = np.where(mtcars['carb'] > 1, True, False) # creating a new variable for the condition

mtcars['carb_rec'].value_counts(normalize = True) # getting its frequency table

mtcars.drop(['carb_rec'], axis = 1, inplace = True) # dropping the variable

# Approach #2

(mtcars['carb'] > 1).sum() / len(mtcars['carb']) # how many cars have more than 1 carb / how many cars are there in total


In [None]:
# Among cars with automatic transmission, how many of them have a straight engine?

pd.crosstab(mtcars['am'], mtcars['vs'])


In [None]:
# Among cars with 6 cylinders, what is the proportion of those having gross horsepower of higher than 150?

# Approach #1

mtcars['hp_rec'] = np.where(mtcars['hp'] > 150, True, False) # creating a new variable for the condition


pd.crosstab(mtcars['cyl'], mtcars['hp_rec'], normalize = 'index') # crosstab with proportions by index (cyl variable)

# Approach #2

((mtcars['cyl'] == 6) & (mtcars['hp'] > 150)).sum() / (mtcars['cyl'] == 6).sum()
# how many cars satisfy the condition / how many cars are there in total


In [None]:
# Among all cars, what is the mean 1/4 mile time?

mtcars['qsec'].mean()


In [None]:
# If we take only those cars that weight more than 75% of other cars and have a V-shaped engine,
# what would be their mean 1/4 mile time?

# Let's first see what is the weight value of a 3rd quartile (75%)

weight_75 = mtcars['wt'].quantile(q = 0.75)

weight_75

# Now let's create a new variable testing whether car weights more than the 75% quartile and has V-shaped engine

mtcars['cond'] = np.where((mtcars['wt'] > weight_75) & (mtcars['vs'] == 0), 'Satisfy', 'Does not satisfy')

mtcars['cond'].value_counts()

# Now select only those cars that satisfy the condition and calculate their mean 1/4 mile time

mtcars.loc[mtcars['cond'] == 'Satisfy', 'qsec'].mean()

mtcars.drop(['cond'], axis = 1, inplace = True)

# Same but all in one command

mtcars.loc[(mtcars['wt'] > weight_75) & (mtcars['vs'] == 0), 'qsec'].mean()


In [None]:
# Get the mean value of miles per gallon for cars with different number of carburetors. Which group has the highest mileage?

mtcars.groupby('carb')['mpg'].mean()


In [None]:
# Get the mean and median values of displacement for cars with different number of cylinders.
# For which group/groups, median value is higher than the mean? 

mtcars.groupby('cyl')['disp'].agg(['mean', 'median'])


In [None]:
# Finally, say we're interested in comparing Mercedes vs all other cars.
# Is it true, that for Mercedes the mean 1/4 mile time is always higher than for other cars, regardless of how many gears they have?

# Run this line to create a new binary variable for Mercedes

mtcars['is_merc'] = np.where(mtcars['model'].str.contains('Merc'), True, False)

mtcars.groupby(['is_merc', 'gear'])['qsec'].mean()


# **That's the end of Day 3!**
