## Section 1: Variable Types

Let's investigate the data in cars.csv

In [2]:
import numpy as np
import pandas as pd

cars = pd.read_csv('cars.csv')
cars.head()

Unnamed: 0,Car,Model,Volume,Weight,CO2,Rating,Average Education
0,Toyoty,Aygo,1000.0,missing,99.276864,Good,bachelors
1,Mitsubishi,Space Star,1200.0,missing,95.565525,Excellent,bachelors
2,Skoda,Citigo,1000.0,929,95.308105,Good,masters
3,Fiat,500,900.0,865,90.111976,Bad,bachelors
4,Mini,Cooper,1500.0,1140,105.032618,Good,high school


In [3]:
cars.tail()

Unnamed: 0,Car,Model,Volume,Weight,CO2,Rating,Average Education
31,Volvo,XC70,2000.0,1746,117.55468,Bad,high school
32,Ford,B-Max,1600.0,1235,104.691499,Excellent,high school
33,BMW,216,1600.0,1390,108.420738,Bad,masters
34,Opel,Zafira,1600.0,1405,109.021072,Excellent,post doctorate
35,Mercedes,SLK,2500.0,1395,120.802845,Bad,bachelors


In [4]:
cars.iloc[6:10]

Unnamed: 0,Car,Model,Volume,Weight,CO2,Rating,Average Education
6,Skoda,Fabia,1400.0,1109,90.611629,Good,post doctorate
7,Mercedes,A-Class,1500.0,1365,92.854248,Good,high school
8,Ford,Fiesta,1500.0,1112,98.13637,Good,high school
9,Audi,A1,,1150,99.621678,Bad,high school


In [5]:
len(cars)

36

In [6]:
cars.dtypes

Car                   object
Model                 object
Volume               float64
Weight                object
CO2                  float64
Rating                object
Average Education     object
dtype: object

In [7]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Car                36 non-null     object 
 1   Model              36 non-null     object 
 2   Volume             34 non-null     float64
 3   Weight             36 non-null     object 
 4   CO2                36 non-null     float64
 5   Rating             36 non-null     object 
 6   Average Education  36 non-null     object 
dtypes: float64(2), object(5)
memory usage: 2.1+ KB


Weight is incorrectly set as string type instead of float64. This prevents us from analyzing the data numerically. Let's find out why.

In [8]:
cars['Weight'].unique()

array(['missing', '929', '865', '1140', '1109', '1365', '1112', '1150',
       '980', '990', '1252', '1326', '1330', '1280', '1119', '1328',
       '1584', '1428', '1415', '1465', '1490', '1725', '1523', '1705',
       '1605', '1746', '1235', '1390', '1405', '1395'], dtype=object)

Replace `missing` with `numpy.nan` so that it can be ignored.

In [9]:
cars['Weight'] = cars['Weight'].replace(['missing'], np.nan)
cars['Weight'].unique()

array([nan, '929', '865', '1140', '1109', '1365', '1112', '1150', '980',
       '990', '1252', '1326', '1330', '1280', '1119', '1328', '1584',
       '1428', '1415', '1465', '1490', '1725', '1523', '1705', '1605',
       '1746', '1235', '1390', '1405', '1395'], dtype=object)

Convert `Weight` to `float`.

In [10]:
cars['Weight'] = cars['Weight'].astype('float')
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Car                36 non-null     object 
 1   Model              36 non-null     object 
 2   Volume             34 non-null     float64
 3   Weight             34 non-null     float64
 4   CO2                36 non-null     float64
 5   Rating             36 non-null     object 
 6   Average Education  36 non-null     object 
dtypes: float64(3), object(4)
memory usage: 2.1+ KB


That's better. Let's calculate the `mean` of `Weight`.

In [11]:
cars['Weight'].mean()

1310.9411764705883

## Categorical -> Ordinal

Ordinal categorical variables which are groups that contain an inherent ranking, such as ratings of plays or responses to a survey question with a point scale e.g., on a scale from 1-7, how happy are you right now?

Convert `Rating` to an ordinal category

In [12]:
cars['Rating'].unique()

array(['Good', 'Excellent', 'Bad'], dtype=object)

In [13]:
cars['Rating'] = pd.Categorical(
    cars['Rating'], ['Bad', 'Good', 'Excellent'], ordered=True)
cars.dtypes

Car                    object
Model                  object
Volume                float64
Weight                float64
CO2                   float64
Rating               category
Average Education      object
dtype: object

In [14]:
cars['Rating'].unique()

['Good', 'Excellent', 'Bad']
Categories (3, object): ['Bad' < 'Good' < 'Excellent']

In [15]:
### Categorical encoding using label encoding

Variables stored as type category have an attribute (`cat.codes`) that converts the categories to numbers and also `values` that gives the string representation.

SyntaxError: invalid syntax (<ipython-input-15-05c54775032d>, line 3)

In [18]:
cars['Rating_codes'] = cars['Rating'].cat.codes
cars['Rating_values'] = cars['Rating'].values
cars.head()

Unnamed: 0,Car,Model,Volume,Weight,CO2,Rating,Average Education,Rating_codes,Rating_values
0,Toyoty,Aygo,1000.0,,99.276864,Good,bachelors,1,Good
1,Mitsubishi,Space Star,1200.0,,95.565525,Excellent,bachelors,2,Excellent
2,Skoda,Citigo,1000.0,929.0,95.308105,Good,masters,1,Good
3,Fiat,500,900.0,865.0,90.111976,Bad,bachelors,0,Bad
4,Mini,Cooper,1500.0,1140.0,105.032618,Good,high school,1,Good


In [24]:
# Can also use value_counts() on a categorical column to get a table of frequencies
cars['Rating'].value_counts()

Good         15
Bad          12
Excellent     9
Name: Rating, dtype: int64

## Categorical -> Nominal

Nominal categorical variables are made of categories without an inherent order, examples of nominal variables are species of ants, or people’s hair color.

### Categorical encoding using one-hot encoding

In [19]:
cars['Average Education'].unique()

array(['bachelors', 'masters', 'high school', 'post doctorate'],
      dtype=object)

In [20]:
cars = pd.get_dummies(data=cars, columns=['Average Education'])
cars.head()

Unnamed: 0,Car,Model,Volume,Weight,CO2,Rating,Rating_codes,Rating_values,Average Education_bachelors,Average Education_high school,Average Education_masters,Average Education_post doctorate
0,Toyoty,Aygo,1000.0,,99.276864,Good,1,Good,1,0,0,0
1,Mitsubishi,Space Star,1200.0,,95.565525,Excellent,2,Excellent,1,0,0,0
2,Skoda,Citigo,1000.0,929.0,95.308105,Good,1,Good,0,0,1,0
3,Fiat,500,900.0,865.0,90.111976,Bad,0,Bad,1,0,0,0
4,Mini,Cooper,1500.0,1140.0,105.032618,Good,1,Good,0,1,0,0


## Final data types

In [21]:
cars.dtypes

Car                                   object
Model                                 object
Volume                               float64
Weight                               float64
CO2                                  float64
Rating                              category
Rating_codes                            int8
Rating_values                       category
Average Education_bachelors            uint8
Average Education_high school          uint8
Average Education_masters              uint8
Average Education_post doctorate       uint8
dtype: object

## Crosstab

Crosstabs are useful for summarizing frequencies of values in a dataframe

In [20]:
# Example

yesses_and_nos = np.random.choice(['yes', 'no'], size = 20)
labels = ['flipped_head']*10 + ['did_not_flip_head']*10
data = pd.DataFrame({'Group': labels, 'Head': yesses_and_nos})
pd.crosstab(data.Group, data.Head)

Head,no,yes
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
did_not_flip_head,6,4
flipped_head,5,5
