# Pandas for Data Analysis

## Agenda

 - Lab 1: Working with data types
 - Lab 2: Summary Statistics
 - Lab 3: Working with missing values

## Data I/O

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# read in the car dataset
df = pd.read_csv('resources/imports-85.csv')

# Lab 1: Missing Values

 - Missing values are filled with '?'. Replace all instances of '?' with null values. Use the `replace` method available for Pandas DataFrames
 - Find which columns are missing values
 - Print shape of observations dataframe, drop all rows with missing values, then print the shape of the resulting dataframe (*do not make changes inplace*)
 - Print the mean of the price column impute any missing values with median of that column and print new resulting mean (*do not make changes inplace*)
     - At first you may find an error. Make sure that the datatype of the price is correct

Use the replace method to replace any instances of '?' with np.nan

In [2]:
df.replace('?',np.nan,inplace=True)

Find columns with missing values. (We know the answer since we inserted the `NaN` to create the synthetic data.)

In [3]:
print(df.isna().any(axis=0))

symboling            False
normalized-losses     True
make                 False
fuel-type            False
aspiration           False
num-of-doors          True
body-style           False
drive-wheels         False
engine-location      False
wheel-base           False
length               False
width                False
height               False
curb-weight          False
engine-type          False
num-of-cylinders     False
engine-size          False
fuel-system          False
bore                  True
stroke                True
compression-ratio    False
horsepower            True
peak-rpm              True
city-mpg             False
highway-mpg          False
price                 True
dtype: bool


Original Shape

Compare shapes. Expect that if we drop na values we have 20 fewer rows.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized-losses    164 non-null object
make                 205 non-null object
fuel-type            205 non-null object
aspiration           205 non-null object
num-of-doors         203 non-null object
body-style           205 non-null object
drive-wheels         205 non-null object
engine-location      205 non-null object
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb-weight          205 non-null int64
engine-type          205 non-null object
num-of-cylinders     205 non-null object
engine-size          205 non-null int64
fuel-system          205 non-null object
bore                 201 non-null object
stroke               201 non-null object
compression-ratio    205 non-null float64
horsepower           203 non-nul

In [5]:
print('original shape: ',df.shape)
print('after dropping rows with missing valeus: ',df.dropna().shape)

original shape:  (205, 26)
after dropping rows with missing valeus:  (159, 26)


Original Mean
* First we will convert the price column to a float

In [6]:
df['price'] = df['price'].astype(float)
df['price'].mean()

13207.129353233831

Mean after replacing null values with Median

In [7]:
med_val = df['price'].median()
df['price'].fillna(med_val).mean()

13150.307317073171

## Lab 2: Working with datatypes

 - Determine the variable names and variable types for this dataset
 - In python it is useful to format concatenated words with an underscore ('_') rather than a dash ('-'). Replace all column names that contain dashes with underscores 
 - Similar to price, you will see that the following columns need to be converted to type float:
     - bore
     - stroke
     - horsepower
     - peak-rpm
 - Find the mode value for all categorical columns
 - The values for body_style should be the following: hardtop, wagon, sedan, hatchback, convertible
     - Check the unique values for that column and apply any adjustments required to ensure that there is a unique identifier for each category
     - Create dummy values from the body_style column and create new dataframe called `df_dummy` with these dummy variables
 - Display skew for all numerical values sorted from largest to smallest absolute value


In [8]:
# Student solution

In [9]:
# Determine the variable names and variable types for this dataset
print(f'\nSchema: \n{(df.dtypes)}')


Schema: 
symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object


#### 3 options discussed in class for Replace all column names that contain dashes with underscores

In [10]:
#1 create a blank list, append adjusted values to list and then replace columns using new list

new_list = []
for i in df.columns:
    new_list.append(i.replace('-','_'))
    
df.columns = new_list

In [11]:
# Use a list comprehension to achieve same result as above
df.columns = [i.replace('-','_') for i in df.columns]

In [12]:
# Use the str.replace to replace values in columns
df.columns = df.columns.str.replace("-","_")

#### 3 options for changing datatype to type float

In [13]:
# Create a list of columns to replace, run for loop to replace those columns
objects = ['bore','stroke','horsepower','peak_rpm']
for i in objects:
    df[i] = df[i].astype('float')

In [14]:
# Run through all object columns and try to replace them with float objects
# Useful if you do not already have a list of columns you need to convert
for i in df.select_dtypes('object').columns:
    try:
        df[i] = df[i].astype('float')
    except Exception as e:
        print(f"col_name: {i}\nerror: {e}")

col_name: make
error: could not convert string to float: 'alfa-romero'
col_name: fuel_type
error: could not convert string to float: 'gas'
col_name: aspiration
error: could not convert string to float: 'std'
col_name: num_of_doors
error: could not convert string to float: 'two'
col_name: body_style
error: could not convert string to float: 'convertible'
col_name: drive_wheels
error: could not convert string to float: 'rwd'
col_name: engine_location
error: could not convert string to float: 'front'
col_name: engine_type
error: could not convert string to float: 'dohc'
col_name: num_of_cylinders
error: could not convert string to float: 'four'
col_name: fuel_system
error: could not convert string to float: 'mpfi'


In [15]:
# Most efficient, simply select columns for adjustment and set them as type float
df[objects] = df[objects].astype('float')

In [16]:
# Find the mode value for all categorical columns
print("Mode for categorical variables")
display(df.select_dtypes('object').mode().T)

Mode for categorical variables


Unnamed: 0,0
make,toyota
fuel_type,gas
aspiration,std
num_of_doors,four
body_style,sedan
drive_wheels,fwd
engine_location,front
engine_type,ohc
num_of_cylinders,four
fuel_system,mpfi


In [17]:
# Check the unique values for that column and apply any adjustments 
# required to ensure that there is a unique identifier for each category
print('originally: ',df['body_style'].unique())
df['body_style'] = df['body_style'].str.lower()
df['body_style'] = df['body_style'].str.strip()
df['body_style'].replace('hachback','hatchback',inplace=True)
print('after adjustments: ',df['body_style'].unique())

originally:  ['convertible' 'sedan' 'wagon' 'hatchback' 'hardtop' 'Wagon' 'hachback'
 '  Wagon']
after adjustments:  ['convertible' 'sedan' 'wagon' 'hatchback' 'hardtop']


In [18]:
# Create dummy values from the body_style columns and create new datafame df_dummy
df_dummy = pd.get_dummies(df,columns = ['body_style'])

df_dummy.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,drive_wheels,engine_location,wheel_base,length,...,horsepower,peak_rpm,city_mpg,highway_mpg,price,body_style_convertible,body_style_hardtop,body_style_hatchback,body_style_sedan,body_style_wagon
0,3,,alfa-romero,gas,std,two,rwd,front,88.6,168.8,...,111.0,5000.0,21,27,13495.0,1,0,0,0,0
1,2,164.0,audi,gas,std,four,fwd,front,99.8,176.6,...,102.0,5500.0,24,30,13950.0,0,0,0,1,0
2,2,164.0,audi,gas,std,four,4wd,front,99.4,176.6,...,115.0,5500.0,18,22,17450.0,0,0,0,1,0
3,2,,audi,gas,std,two,fwd,front,99.8,177.3,...,110.0,5500.0,19,25,15250.0,0,0,0,1,0
4,1,158.0,audi,gas,std,four,fwd,front,105.8,192.7,...,110.0,5500.0,19,25,17710.0,0,0,0,1,0


In [19]:
# Display skew for all numerical values sorted from largest to smallest absolute value
abs(df.skew()).sort_values()

bore                 0.020016
height               0.063123
peak_rpm             0.073237
length               0.155954
symboling            0.211072
highway_mpg          0.539997
city_mpg             0.663704
curb_weight          0.681398
stroke               0.683122
normalized_losses    0.765976
width                0.904003
wheel_base           1.050214
horsepower           1.391029
price                1.809675
engine_size          1.947655
compression_ratio    2.610862
dtype: float64

## Lab 3: Summary Statistics

 - Use describe method to get summary statistics for numerical and object values in two seperate dataframes
 - Find feature with strongest correlation statisitc with price (Use absolute value)
     - Hint: idxmax is useful method in pandas for getting index for max value for series or dataframe. [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmax.html) 

     
 
### Optional
 - Amongst the numerical values, find the feature with which each column is most correlated
     - Hints:
         - Once you have your corellation matrix it will help to find max value if you replace the diagonal (correlations of features with themselves) with 0s (a for loop and iloc will help you here)
         - Once the diagonals are zero'd out look to again use idxmax

In [20]:
# Student solution

In [21]:
#Use describe method to get summary statistics for numerical and object values in two seperate dataframes
print('numerical summary stats:')
display(df.describe())

print('\n\nobject summary stats:')
display(df.describe(include = ['object']))

numerical summary stats:


Unnamed: 0,symboling,normalized_losses,wheel_base,length,width,height,curb_weight,engine_size,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
count,205.0,164.0,205.0,205.0,205.0,205.0,205.0,205.0,201.0,201.0,205.0,203.0,203.0,205.0,205.0,201.0
mean,0.834146,122.0,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.329751,3.255423,10.142537,104.256158,5125.369458,25.219512,30.75122,13207.129353
std,1.245307,35.442168,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.273539,0.316717,3.97204,39.714369,479.33456,6.542142,6.886443,7947.066342
min,-2.0,65.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,94.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7775.0
50%,1.0,115.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,2.0,150.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.59,3.41,9.4,116.0,5500.0,30.0,34.0,16500.0
max,3.0,256.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0,45400.0




object summary stats:


Unnamed: 0,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,engine_type,num_of_cylinders,fuel_system
count,205,205,205,203,205,205,205,205,205,205
unique,22,2,2,2,5,3,2,7,7,8
top,toyota,gas,std,four,sedan,fwd,front,ohc,four,mpfi
freq,32,185,168,114,96,120,202,148,159,94


In [22]:
abs(df.drop('price',axis=1).corrwith(df.price)).sort_values()

compression_ratio    0.071107
stroke               0.082310
symboling            0.082391
peak_rpm             0.101649
height               0.135486
normalized_losses    0.203254
bore                 0.543436
wheel_base           0.584642
city_mpg             0.686571
length               0.690628
highway_mpg          0.704692
width                0.751265
horsepower           0.810533
curb_weight          0.834415
engine_size          0.872335
dtype: float64

In [23]:
# Find feature with strongest correlation statisitc with loan_amnt (Use absolute value)
loan_corr = abs(df.drop('price',axis=1).corrwith(df.price))
print(f'The feature with the strongest correlation with price is "{loan_corr.idxmax()}" with a correlation coefficient of: {round(loan_corr.max(),2)}')


The feature with the strongest correlation with price is "engine_size" with a correlation coefficient of: 0.87


In [26]:
# Amongst the numerical values, find the feature with which each column is most correlated
full_corr = df.corr()

for i in range(len(full_corr)):
    full_corr.iloc[i,i] = 0

print('\n\nThe strongest correlated features are:')
full_corr.idxmax()




The strongest correlated features are:


symboling            normalized_losses
normalized_losses            symboling
wheel_base                      length
length                     curb_weight
width                      curb_weight
height                      wheel_base
curb_weight                     length
engine_size                      price
bore                       curb_weight
stroke                     engine_size
compression_ratio             city_mpg
horsepower                 engine_size
peak_rpm                     symboling
city_mpg                   highway_mpg
highway_mpg                   city_mpg
price                      engine_size
dtype: object