# Machine learning zoomcamp

* https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp

## Pandas

* Video: https://www.youtube.com/watch?v=0j3XK5PsnxA
* Notebook from Video: https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/01-intro/notebooks/09-pandas.ipynb
* Notebook for exercise: https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/appendix-d-pandas.ipynb

## My Exercise and Notes

Pandas is python library for data analysis and manipulation. Pandas data is in tabular format.

In [1]:
import numpy as np
import pandas as pd

Sample data for analysis and manipulation. Defined as list of lists.

In [4]:
data = [
    ['Nissan', 'Stanza', 1991, 138, 4, 'MANUAL', 'sedan', 2000],
    ['Hyundai', 'Sonata', 2017, None, 4, 'AUTOMATIC', 'Sedan', 27150],
    ['Lotus', 'Elise', 2010, 218, 4, 'MANUAL', 'convertible', 54990],
    ['GMC', 'Acadia',  2017, 194, 4, 'AUTOMATIC', '4dr SUV', 34450],
    ['Nissan', 'Frontier', 2017, 261, 6, 'MANUAL', 'Pickup', 32340],
]

### DataFrames

Load data into a dataframe. Pandas dataframe is a table.

In [5]:
pd.DataFrame(data)

Unnamed: 0,0,1,2,3,4,5,6,7
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


Let us define some meaningful names to the columns

In [8]:
column_names = [
    'Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders',
    'Transmission Type', 'Vehicle_Style', 'MSRP'
]

In [9]:
pd.DataFrame(data, columns=column_names)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [10]:
df = pd.DataFrame(data, columns=column_names)
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


Alternatively, we can use a list of dictionaries to create a dataframe. Here we have a dictionary for every row of the data with column name as the key and the column value as the value of the dictionary. When using such a format, there is no need to separately define columns.

In [12]:
data = [
    {
        "Make": "Nissan",
        "Model": "Stanza",
        "Year": 1991,
        "Engine HP": 138.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "sedan",
        "MSRP": 2000
    },
    {
        "Make": "Hyundai",
        "Model": "Sonata",
        "Year": 2017,
        "Engine HP": None,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "Sedan",
        "MSRP": 27150
    },
    {
        "Make": "Lotus",
        "Model": "Elise",
        "Year": 2010,
        "Engine HP": 218.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "convertible",
        "MSRP": 54990
    },
    {
        "Make": "GMC",
        "Model": "Acadia",
        "Year": 2017,
        "Engine HP": 194.0,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "4dr SUV",
        "MSRP": 34450
    },
    {
        "Make": "Nissan",
        "Model": "Frontier",
        "Year": 2017,
        "Engine HP": 261.0,
        "Engine Cylinders": 6,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "Pickup",
        "MSRP": 32340
    }
]

In [14]:
df = pd.DataFrame(data)
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


Display the first few entries of the dataframe

In [18]:
df.head()

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


Display first *n* entries of the dataframe

In [19]:
df.head(2)  # or df.head(n=2)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150


## Series

A table consists of columns. In Pandas dataframe is a table and columns of the table are series. Thus dataframe may consist of multiple series.

To access it, use dot or brackets. When column names have spaces, use the brackets notation.

In [22]:
df.Make

0     Nissan
1    Hyundai
2      Lotus
3        GMC
4     Nissan
Name: Make, dtype: object

In [23]:
df['Make']

0     Nissan
1    Hyundai
2      Lotus
3        GMC
4     Nissan
Name: Make, dtype: object

In [24]:
df.Engine HP

SyntaxError: invalid syntax (1897567212.py, line 1)

In [25]:
df['Engine HP']

0    138.0
1      NaN
2    218.0
3    194.0
4    261.0
Name: Engine HP, dtype: float64

In [26]:
col_name = 'Engine HP'
df[col_name]

0    138.0
1      NaN
2    218.0
3    194.0
4    261.0
Name: Engine HP, dtype: float64

Use a list to select a subset of the columns

In [27]:
columns = ['Make', 'Model', 'Year', 'MSRP']
df[columns]

Unnamed: 0,Make,Model,Year,MSRP
0,Nissan,Stanza,1991,2000
1,Hyundai,Sonata,2017,27150
2,Lotus,Elise,2010,54990
3,GMC,Acadia,2017,34450
4,Nissan,Frontier,2017,32340


Add more columns to dataframe

In [28]:
df['id'] = [1, 2, 3, 4, 5]
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP,id
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000,1
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150,2
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990,3
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450,4
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340,5


Delete a column

In [29]:
del df['id']
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


## Index

Rows of a dataframe are called indices.

In [30]:
df.index

RangeIndex(start=0, stop=5, step=1)

We can define different ids to the indices (instead of the default 0, 1, 2 ...)

In [36]:
df.index = ['a', 'b', 'c', 'd', 'e']
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


## Accessing elements

To get data in a particular row, access the data using index id and df.loc

In [38]:
df.loc['a']

Make                 Nissan
Model                Stanza
Year                   1991
Engine HP             138.0
Engine Cylinders          4
Transmission Type    MANUAL
Vehicle_Style         sedan
MSRP                   2000
Name: a, dtype: object

You can also access the data using positional index and df.iloc 

In [39]:
df.iloc[0]

Make                 Nissan
Model                Stanza
Year                   1991
Engine HP             138.0
Engine Cylinders          4
Transmission Type    MANUAL
Vehicle_Style         sedan
MSRP                   2000
Name: a, dtype: object

Access multiple rows

In [40]:
df.loc[['b', 'c']]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990


In [41]:
df.iloc[[1,2]]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990


To reset the index to the default of numerical 0, 1, ... use df.reset_index. This will add a numerical index and maintain the existing index as a new column.

In [46]:
df.reset_index()

Unnamed: 0,index,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


If you want to not have the previous index saved as a new columns use drop=True

In [48]:
df.reset_index(drop=True)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [49]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


Note above, the original dataframe is still the same. To save the new dataframe, assign it to the original datadframe itself

In [50]:
df = df.reset_index(drop=True)
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


## Element wise operations

Under the hood for operations, pandas uses numpy

In [54]:
df['MSRP']

0     2000
1    27150
2    54990
3    34450
4    32340
Name: MSRP, dtype: int64

Perform operations on a single series

In [55]:
df['MSRP'] / 1000

0     2.00
1    27.15
2    54.99
3    34.45
4    32.34
Name: MSRP, dtype: float64

Or perform on multiple series

In [65]:
df[['Engine HP', 'MSRP']] / 100

Unnamed: 0,Engine HP,MSRP
0,1.38,20.0
1,,271.5
2,2.18,549.9
3,1.94,344.5
4,2.61,323.4


In [56]:
df['Year']

0    1991
1    2017
2    2010
3    2017
4    2017
Name: Year, dtype: int64

Logical operations

In [58]:
df['Year'] >= 2015

0    False
1     True
2    False
3     True
4     True
Name: Year, dtype: bool

## Filtering

To look at all records where Year is greater than or equal to 2015 as an example, we can apply filter as follows. It wil display all record where condition is True.

dataframe[condition]

In [61]:
df[df['Year'] >= 2015]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [62]:
df [
    df['Make'] == 'Nissan'
]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [63]:
df [
    (df['Year'] >= 2015) & (df['Make'] == 'Nissan')
]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


## String operations

Unlike Numpy which can only do numeric operations, pandas can also do string operations

In [64]:
df['Vehicle_Style']

0          sedan
1          Sedan
2    convertible
3        4dr SUV
4         Pickup
Name: Vehicle_Style, dtype: object

Using the str method of pandas allows us to use any of the standard string methods like lower (to convert all upper case to lower case), replace (to replace a character in the string with another character) and so on.

In [68]:
df['Vehicle_Style'].str.lower()

0          sedan
1          sedan
2    convertible
3        4dr suv
4         pickup
Name: Vehicle_Style, dtype: object

In [72]:
df['Vehicle_Style'].str.replace(' ', '_')

0          sedan
1          Sedan
2    convertible
3        4dr_SUV
4         Pickup
Name: Vehicle_Style, dtype: object

In [71]:
df['Vehicle_Style'].str.lower().str.replace(' ', '_')

0          sedan
1          sedan
2    convertible
3        4dr_suv
4         pickup
Name: Vehicle_Style, dtype: object

In [73]:
df['Vehicle_Style'] = df['Vehicle_Style'].str.lower().str.replace(' ', '_')
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr_suv,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,pickup,32340


## Sumarizing operations

Operations on columns with numerical values. These are termed as numerical columns

In [75]:
print(df['MSRP'].min())
print(df['MSRP'].max())
print(df['MSRP'].mean())

2000
54990
30186.0


In [76]:
df['MSRP'].describe()

count        5.000000
mean     30186.000000
std      18985.044904
min       2000.000000
25%      27150.000000
50%      32340.000000
75%      34450.000000
max      54990.000000
Name: MSRP, dtype: float64

In [77]:
df.describe()

Unnamed: 0,Year,Engine HP,Engine Cylinders,MSRP
count,5.0,4.0,5.0,5.0
mean,2010.4,202.75,4.4,30186.0
std,11.260551,51.29896,0.894427,18985.044904
min,1991.0,138.0,4.0,2000.0
25%,2010.0,180.0,4.0,27150.0
50%,2017.0,206.0,4.0,32340.0
75%,2017.0,228.75,4.0,34450.0
max,2017.0,261.0,6.0,54990.0


In [78]:
df.describe().round()

Unnamed: 0,Year,Engine HP,Engine Cylinders,MSRP
count,5.0,4.0,5.0,5.0
mean,2010.0,203.0,4.0,30186.0
std,11.0,51.0,1.0,18985.0
min,1991.0,138.0,4.0,2000.0
25%,2010.0,180.0,4.0,27150.0
50%,2017.0,206.0,4.0,32340.0
75%,2017.0,229.0,4.0,34450.0
max,2017.0,261.0,6.0,54990.0


Operations on columns with string values. These are termed as Categorical columns.

In [79]:
df['Make']

0     Nissan
1    Hyundai
2      Lotus
3        GMC
4     Nissan
Name: Make, dtype: object

Find number of unique values

In [81]:
df['Make'].nunique()

4

In [82]:
df.nunique()

Make                 4
Model                5
Year                 3
Engine HP            4
Engine Cylinders     2
Transmission Type    2
Vehicle_Style        4
MSRP                 5
dtype: int64

## Missing values

Typically when working with machine learning, we do not want any missing values in the data. And if there are any, we need to process/transform them.

In [83]:
df['Engine HP']

0    138.0
1      NaN
2    218.0
3    194.0
4    261.0
Name: Engine HP, dtype: float64

In [84]:
df['Engine HP'].isnull()

0    False
1     True
2    False
3    False
4    False
Name: Engine HP, dtype: bool

In [85]:
df.isnull()

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,False,False,False,False,False,False,False,False
1,False,False,False,True,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False


In [87]:
df.isnull().sum()

Make                 0
Model                0
Year                 0
Engine HP            1
Engine Cylinders     0
Transmission Type    0
Vehicle_Style        0
MSRP                 0
dtype: int64

Fill missing values with 0. Although in ML, usually it is not preferred to fil with zero, since this will cause the data to change such that the behavior is different (e.g. now the mean is different)

In [89]:
df['Engine HP'].fillna(0)

0    138.0
1      0.0
2    218.0
3    194.0
4    261.0
Name: Engine HP, dtype: float64

Typically in ML, we fill missing values with the mean value, so that the overall mean of the data is still the same. But it depends on case to case basis.

In [92]:
df['Engine HP'].fillna(df['Engine HP'].mean())

0    138.00
1    202.75
2    218.00
3    194.00
4    261.00
Name: Engine HP, dtype: float64

You can drop the data having missing values. Again, this is a conscious decision to be taken considering the effect of dropping data like this - whether there is enough data to afford chuking out data like this, whether removing data in this way affects the distribution of the data [example whether it will cause certain samples of data to be more than certain other samples] and so on.

Drop entire row where there is any missing values. This can be done by usin axis=0 (which is the default behavior). In Pandas axis=0 means rows and axis=1 means columns (where as in Numpy its other way round - so remember this)

In [95]:
df.dropna(axis=0)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr_suv,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,pickup,32340


In [96]:
df.dropna(axis=1)

Unnamed: 0,Make,Model,Year,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,4,AUTOMATIC,sedan,27150
2,Lotus,Elise,2010,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,4,AUTOMATIC,4dr_suv,34450
4,Nissan,Frontier,2017,6,MANUAL,pickup,32340


To remove rows/columns having missing values when the entire row/column has missing values or even if there is a single missing value. Control this using how='all' or how='any'

In [98]:
df.dropna(how='all')

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr_suv,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,pickup,32340


In [99]:
df.dropna(how='any')

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr_suv,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,pickup,32340


## Grouping

In [104]:
df.groupby('Transmission Type').MSRP.mean()

Transmission Type
AUTOMATIC    30800.000000
MANUAL       29776.666667
Name: MSRP, dtype: float64

## Get the underlying Numpy array

In [105]:
df['MSRP']

0     2000
1    27150
2    54990
3    34450
4    32340
Name: MSRP, dtype: int64

In [106]:
df['MSRP'].values

array([ 2000, 27150, 54990, 34450, 32340])

In [108]:
np.log1p(df['MSRP'].values)

array([ 7.60140233, 10.20916916, 10.91492481, 10.4472933 , 10.38409105])

Get the list of dictionaries representation of the datafram data [Remember we started with this to populate the dataframe, we can convert back to this format]

In [107]:
df.to_dict(orient='records')

[{'Make': 'Nissan',
  'Model': 'Stanza',
  'Year': 1991,
  'Engine HP': 138.0,
  'Engine Cylinders': 4,
  'Transmission Type': 'MANUAL',
  'Vehicle_Style': 'sedan',
  'MSRP': 2000},
 {'Make': 'Hyundai',
  'Model': 'Sonata',
  'Year': 2017,
  'Engine HP': nan,
  'Engine Cylinders': 4,
  'Transmission Type': 'AUTOMATIC',
  'Vehicle_Style': 'sedan',
  'MSRP': 27150},
 {'Make': 'Lotus',
  'Model': 'Elise',
  'Year': 2010,
  'Engine HP': 218.0,
  'Engine Cylinders': 4,
  'Transmission Type': 'MANUAL',
  'Vehicle_Style': 'convertible',
  'MSRP': 54990},
 {'Make': 'GMC',
  'Model': 'Acadia',
  'Year': 2017,
  'Engine HP': 194.0,
  'Engine Cylinders': 4,
  'Transmission Type': 'AUTOMATIC',
  'Vehicle_Style': '4dr_suv',
  'MSRP': 34450},
 {'Make': 'Nissan',
  'Model': 'Frontier',
  'Year': 2017,
  'Engine HP': 261.0,
  'Engine Cylinders': 6,
  'Transmission Type': 'MANUAL',
  'Vehicle_Style': 'pickup',
  'MSRP': 32340}]

To add

From https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/appendix-d-pandas.ipynb
* Splitting data
* dtypes
* sorting and reordering
* agg