# **Introduction to Python. Day 3**

## *Dr Kirils Makarovs*

## *k.makarovs@exeter.ac.uk*

## *University of Exeter Q-Step Centre*

---


# **Welcome to Day 3!**

## **By now, we have covered:**

+ The overall workflow of Jupyter Notebooks in Google Colab
+ The basics of Python syntax and operations with lists
+ How to navigate datasets - subsetting, accessing rows/columns
+ Operations with variables i.e. recoding, creating new variables


## **Today, we are going to look at the exploratory data analysis, i.e.:**

+ How to read in external datasets
+ Frequency tables, crosstabs
+ Descriptive statistics for numeric variables
+ Aggregated statistics

---



# **Preparing to work in Python**

In [None]:
# Import the necessary libraries

import pandas as pd # data analysis and management library
import numpy as np # multi-dimensional arrays


---

# **1. Reading in external datasets**

<figure>
<left>
<img src=https://miro.medium.com/max/481/1*n_ms1q5YoHAQXXUIfeADKQ.png  width="450">
</figure>

## **Loading dataframes from external sources**

With *Pandas*, you can open datasets that are stored in a great variety of formats

(just start typeing `pd.read_` in a code cell and you'll see all the possible options)

Here is a list of most common types of data and commands to open them

| Data format | Explanation | Command
| ------- | -------- | ---
| .csv | Comma-separated values | pd.read_csv
| .xls / .xlsx | Excel spreadsheet | pd.read_excel
| .dta | STATA Data file format | pd.read_stata
| .sav | SPSS Data file format | pd.read_spss

This is how a typical dataset looks like:

<figure>
<left>
<img src=https://media.geeksforgeeks.org/wp-content/uploads/finallpandas.png width="550">
</figure>

**[Image source](https://www.geeksforgeeks.org/python-pandas-dataframe/)**




In [None]:
# Let's upload the Pokemon dataset into the current Google Colab session

from google.colab import files

uploaded = files.upload()


In [None]:
# Here is an example of how to open a .csv dataframe in Python using Pandas library

df = pd.read_csv('pokemon.csv')

df

# Data source: https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6


---

# **2. Frequency tables and crosstabs**



In [None]:
# Exploring the datasets

df.shape # number of rows and columns

df.info() # information about the variable


## **Frequency tables**

In [None]:
# Let's first get the frequency table for the 'Type 1' variable.
# We can use pandas .value_counts() method for this

df['Type 1'].value_counts() # sorts in descending order by default

df['Type 1'].value_counts(ascending = True) # sort in ascending orders

df['Type 1'].value_counts(normalize = True) # get proportions instead of frequencies

df['Type 1'].value_counts(normalize = True).round(2) * 100 # round proportions to two decimals and multiply by 100 to get percentage


## **Crosstabs**

In [None]:
# You can get a crosstab in pandas via .crosstab() method

# Crosstab of Generation and Legendary variables

pd.crosstab(df['Generation'], df['Legendary'])

pd.crosstab(df['Generation'], df['Legendary'], margins = True) # Add row/column margins (subtotals)

pd.crosstab(df['Generation'], df['Legendary'],
            margins = True, normalize = 'index') # row proportions, assuming that Generation affects Legendary status

pd.crosstab(df['Generation'], df['Legendary'],
            margins = True, normalize = 'columns') # column proportions, assuming that Legendary status affects Pokemon's generation

# Round the numbers and multiply by 100

pd.crosstab(df['Generation'], df['Legendary'],
            margins = True, normalize = 'index').round(2) * 100

# Would you say that Pokemon's generation affects its Legendary status?


---

# **3. Descriptive statistics for numeric variables**


In [None]:
# Let's ask pandas to round up all numbers to two decimals

pd.set_option("display.precision", 2)


In [None]:
# If you run .describe() method on the entire dataframe, it will pick out numeric variables
# and provide following statistics: count, mean, std, min, max, 25% and 75% percentile, and median (50% percentile)

df.describe()

# You can transpose the matrix if you wish

df.describe().transpose()

# Works on separate columns too:

df['HP'].describe()


In [None]:
# This is how you can get all these statistics separately:

df['HP'].count() # counts non-missing cells

df['HP'].mean() # mean value

df['HP'].std() # standard deviation

df['HP'].min() # minimum value

df['HP'].max() # maximum value

df['HP'].median() # median value

df['HP'].quantile(q = [0.25, 0.75]) # 25% and 75% quantiles


In [None]:
# Works for more than one variable too

df[['Attack', 'Defense']].mean()

# And for more than one statistic as well, but in this case you need to use .agg() method
# Statistics are supplied as a list and each one of them should be in speech marks

df[['Attack']].agg(['count', 'mean', 'std', 'median'])


---

# **4. Aggregated statistics**

<figure>
<left>
<img src=https://miro.medium.com/max/1400/0*XVlrOuSBNKwIZpPj.png width="500">
</figure>

**[Image source](https://towardsdatascience.com/7-pandas-functions-that-i-use-the-most-b83ddbaf53bf)**

In [None]:
# Aggregated analysis implies that you get some statistic (e.g. mean or median) of your main variable
# separately for groups of observations defined by some other variable.

# (also known as Split-Apply-Combine technique)

# Your main variable should continuous, whereas grouping variable - categorical. 

# For example:

# mean income for men and women
# median level of life satisfaction for young, middle-aged, and elderly people
# 25% and 75% percentiles of anxiety scale for 1st, 2nd, and 3rd-year students

# The process of getting aggregated statistics requires two steps:
# You first group your data via .groupby() method,
# and then get aggregated values

# Getting the mean HP level for legendary and non-legendary pokemons

df.groupby('Legendary')['HP'].mean()

# Getting the median Attack level per Pokemon generation

df.groupby('Generation')['Attack'].median()

# You can get statistic for more than one outcome variable

df.groupby(['Generation'])[['Attack', 'Defense']].median()

# It is also possible to get more than one statistic simultaneously,
# but then they need to be wrapper up in the .agg() method

# Let say that for each generation of Pokemons, I want to know
# the mean and standard deviation of their HP, Attack, and Defense points

df.groupby('Generation')[['Attack', 'Defense']].agg(['mean', 'std'])


## **Exercise**

Alright, it's time to practice!


In [None]:
# Let's upload the mtcars dataset into the current Google Colab session

from google.colab import files

uploaded = files.upload()


In [None]:
# And save it as an mtcars object

mtcars = pd.read_csv('mtcars.csv')

mtcars.head(10)


In [None]:
# Here is the description of the dataset

# Motor Trend Car Road Tests data
# The data was extracted from the 1974 Motor Trend US magazine,
# and comprises fuel consumption and 10 aspects of automobile design and performance
# for 32 automobiles (1973–74 models).

# mpg - Miles/(US) gallon
# cyl - Number of cylinders
# disp - Displacement (cu.in.)
# hp - Gross horsepower
# drat - Rear axle ratio
# wt - Weight (1000 lbs)
# qsec - 1/4 mile time
# vs - Engine (0 = V-shaped, 1 = straight)
# am - Transmission (0 = automatic, 1 = manual)
# gear - Number of forward gears
# carb - Number of carburetors


Using the `mtcars` dataset, please answer the following questions:

+ How many cars have **8 cylinders**?
+ Among all cars, what is the proportion of those that have a **V-shaped engine**?
+ Among all cars, what is the proportion of those having **4 gears**?
+ Among cars with **automatic transmission**, how many of them have a **straight engine**?
+ Among cars with **automatic transmission**, what is the *proportion* of those having a **straight engine**?
+ Among cars with a **straight engine**, what is the *proportion* of those having an **automatic transmission**?
+ Among all cars, what is the **mean 1/4 mile time**?
+ Is the **median number of forward gears** is greater than the **median number of carburetors**? What about the **mean value**?
+ Get the **mean value of miles per gallon** for cars with different **number of carburetors**. Which group has the highest average mileage?
+ Get the **standard deviation of weight** for cars with different number of **transmissions** and **engines**. Which group has highest variation in weight?
+ Get the **mean and median values of displacement** for cars with different **number of cylinders**. For which group/groups, median value is higher than the mean? 




In [None]:
# How many cars have 8 cylinders?

mtcars['cyl'].value_counts()


In [None]:
# Among all cars, what is the proportion of those that have a V-shaped engine?

mtcars['vs'].value_counts(normalize = True)


In [None]:
# Among all cars, what is the proportion of those having 4 gears?

mtcars['gear'].value_counts(normalize = True)


In [None]:
# Among cars with automatic transmission, how many of them have a straight engine?

pd.crosstab(mtcars['am'], mtcars['vs'])


In [None]:
# Among cars with automatic transmission, what is the proportion of those having a straight engine?

pd.crosstab(mtcars['am'], mtcars['vs'], normalize = 'index')


In [None]:
# Among cars with a straight engine, what is the proportion of those having an automatic transmission?

pd.crosstab(mtcars['am'], mtcars['vs'], normalize = 'columns')


In [None]:
# Among all cars, what is the mean 1/4 mile time?

mtcars['qsec'].mean()


In [None]:
# Is the median number of forward gears is greater than the median number of carburetors? What about the mean value?

mtcars[['gear', 'carb']].median()

mtcars[['gear', 'carb']].mean()


In [None]:
# Get the mean value of miles per gallon for cars with different number of carburetors.
# Which group has the highest mileage?

mtcars.groupby('carb')['mpg'].mean()


In [None]:
# Get the standard deviation of weight for cars with different number of transmissions and engines.
# Which group has highest variation in weight?

mtcars.groupby(['vs', 'am'])['wt'].std()


In [None]:
# Get the mean and median values of displacement for cars with different number of cylinders.
# For which group/groups, median value is higher than the mean? 

mtcars.groupby('cyl')['disp'].agg(['mean', 'median'])


# **That's the end of Day 3!**
