Notes:

Section 1.1 - Introduction (https://www.slideshare.net/AlexeyGrigorev/ml-zoomcamp-11-introduction-to-machine-learning)
* Features - what we know
* Target - what we want to predict
* Features + Target (-> predict) = predictions
* Features (-> model) = predictions
* Model - output that encapsulates the patterns in the data

Spam code:
```
def detect_spam(email):
    if email.sender == 'promotions@online.com':
        return SPAM
    if contains(email.title, ['tax', 'review']) and domain(email.sender, 'online.com'):
        return SPAM
    if contains(email.body, ['deposit']):
        return SPAM
    return GOOD
```

Section 1.2 - ML vs Rule-based Systems (https://www.slideshare.net/AlexeyGrigorev/ml-zoomcamp-12-ml-vs-rulebased-systems)
* Rule based - Code + Data -> Software = Outcome
* ML - Data + Outcome -> ML = Model

Section 1.3 - Supervised Machine Learning (https://www.slideshare.net/AlexeyGrigorev/ml-zoomcamp-13-supervised-machine-learning)
* X - feature matrix, two-dimensional array
* y - one-dimensional array
* g - function or the model that takes in the matrix X
* g - model
* X - features
* y - target
* g(X) ~ y

In Supervised Machine Learning (SML) there are always labels associated with certain features. The model is trained, and then it can make predictions on new features. In this way, the model is taught by certain features and targets.

* Feature matrix (X): made of observations or objects (rows) and features (columns).
* Target variable (y): a vector with the target information we want to predict. For each row of X there's a value in y.
* The model can be represented as a function g that takes the X matrix as a parameter and tries to predict values as close as possible to y targets. The obtention of the g function is what it is called training.

Types of SML problems
* Regression: the output is a number (car's price)
* Classification: the output is a category (spam example).
* Binary: there are two categories.
* Multiclass problems: there are more than two categories.
* Ranking: the output is the big scores associated with certain items. It is applied in recommender systems.

In summary, SML is about teaching the model by showing different examples, and the goal is to come up with a function that takes the feature matrix as a parameter and makes predictions as close as possible to the y targets.

Section 1.4 - CRISP-DM (https://www.slideshare.net/AlexeyGrigorev/ml-zoomcamp-14-crispdm)

CRISP-DM is a methodology for organizing ML projects. It was invented in the 90s by IBM. The steps of this procedure are:

1. Business understanding: An important question is if do we need ML for the project. The goal of the project has to be measurable.
2. Data understanding: Analyze available data sources, and decide if more data is required.
3. Data preparation: Clean data and remove noise applying pipelines, and the data should be converted to a tabular format, so we can put it into ML.
4. Modeling: training Different models and choose the best one. Considering the results of this step, it is proper to decide if is required to add new features or fix data issues.
5. Evaluation: Measure how well the model is performing and if it solves the business problem.
6. Deployment: Roll out to production to all the users. The evaluation and deployment often happen together - online evaluation.
7. Iterate: Start simple, learn from the feedback, improve.
It is important to consider how well maintainable the project is.

In general, ML projects require many iterations.

Iteration:
* Start simple
* Learn from the feedback
* Improve

Section 1.5 - Model Selection Process (https://www.slideshare.net/AlexeyGrigorev/ml-zoomcamp-15-model-selection-process)

The validation dataset is not used in training. There are feature matrices and y vectors for both training and validation datasets. The model is fitted with training data, and it is used to predict the y values of the validation feature matrix. Then, the predicted y values (probabilities) are compared with the actual y values.

Multiple comparisons problem (MCP): just by chance one model can be lucky and obtain good predictions because all of them are probabilistic.

The test set can help to avoid the MCP. Obtention of the best model is done with the training and validation datasets, while the test dataset is used for assuring that the proposed best model is the best.

1. Split datasets in training, validation, and test.
2. Train the models
3. Evaluate the models
4. Select the best model
5. Apply the best model to the test dataset
6. Compare the performance metrics of validation and test

Section 1.7 - Introduction to NumPy (https://www.datacamp.com/community/blog/python-numpy-cheat-sheet)

Plan:
* Creating arrays
* Multi-dimensional array
* Randomly generated arrays
* Element-wise operations
    * Comparison operations
    * Logical operations
* Summarizing operations

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

### Creating arrays

In [2]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [3]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [4]:
np.full(10, 2.5)

array([2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5])

In [5]:
np.array([1, 2, 3, 5, 7, 12])

array([ 1,  2,  3,  5,  7, 12])

In [6]:
a = np.array([1, 2, 3, 5, 7, 12])
a

array([ 1,  2,  3,  5,  7, 12])

In [7]:
a[2]

3

In [8]:
a[2] = 10

In [9]:
a

array([ 1,  2, 10,  5,  7, 12])

In [10]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [11]:
np.arange(3, 10)

array([3, 4, 5, 6, 7, 8, 9])

In [12]:
np.linspace(0, 1, 11)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

In [13]:
np.linspace(0, 100, 11)

array([  0.,  10.,  20.,  30.,  40.,  50.,  60.,  70.,  80.,  90., 100.])

### Multi-dimensional arrays

In [14]:
np.zeros((5, 2))

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [15]:
n = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

In [16]:
n[2, 1]

8

In [17]:
n[0, 1] = 20

In [18]:
n

array([[ 1, 20,  3],
       [ 4,  5,  6],
       [ 7,  8,  9]])

In [19]:
n[0]

array([ 1, 20,  3])

In [20]:
n[2] = [1, 1, 1]

In [21]:
n

array([[ 1, 20,  3],
       [ 4,  5,  6],
       [ 1,  1,  1]])

In [22]:
n[:, 1]

array([20,  5,  1])

In [23]:
n[:, 2] = [0, 1, 2]

In [24]:
n

array([[ 1, 20,  0],
       [ 4,  5,  1],
       [ 1,  1,  2]])

### Randomly generated arrays

In [25]:
np.random.seed(2)
np.random.rand(5, 2)

array([[0.4359949 , 0.02592623],
       [0.54966248, 0.43532239],
       [0.4203678 , 0.33033482],
       [0.20464863, 0.61927097],
       [0.29965467, 0.26682728]])

In [26]:
np.random.seed(3)
np.random.randn(5, 2)

array([[ 1.78862847,  0.43650985],
       [ 0.09649747, -1.8634927 ],
       [-0.2773882 , -0.35475898],
       [-0.08274148, -0.62700068],
       [-0.04381817, -0.47721803]])

In [27]:
np.random.randint(low=0, high=100, size=(5, 2))

array([[26, 81],
       [90, 22],
       [66,  2],
       [63, 60],
       [ 1, 51]])

### Element-wise operations

In [28]:
a = np.arange(5)
a

array([0, 1, 2, 3, 4])

In [29]:
a + 1

array([1, 2, 3, 4, 5])

In [30]:
a * 100

array([  0, 100, 200, 300, 400])

In [31]:
a / 100

array([0.  , 0.01, 0.02, 0.03, 0.04])

In [32]:
10 + (a * 2)

array([10, 12, 14, 16, 18])

In [33]:
b = (10 + (a * 2)) ** 2 / 100

In [34]:
b

array([1.  , 1.44, 1.96, 2.56, 3.24])

In [35]:
a + b

array([1.  , 2.44, 3.96, 5.56, 7.24])

### Comparison operations

In [36]:
a

array([0, 1, 2, 3, 4])

In [37]:
a >= 2

array([False, False,  True,  True,  True])

In [38]:
b

array([1.  , 1.44, 1.96, 2.56, 3.24])

In [39]:
a > b

array([False, False,  True,  True,  True])

In [40]:
a[a > b]

array([2, 3, 4])

### Summarizing operations

In [41]:
a.min()

0

In [42]:
a.max()

4

In [43]:
a.sum()

10

In [44]:
a.mean()

2.0

In [45]:
a.std()

1.4142135623730951

In [46]:
n.min()

0

Section 1.8 - Linear Algebra Refresher (https://www.slideshare.net/AlexeyGrigorev/ml-zoomcamp-18-linear-algebra-refresher)

Plan:
* Vector operations
* Multiplication
* Matrix inverse

In [47]:
u = np.array([2, 4, 5, 6])

In [48]:
u

array([2, 4, 5, 6])

In [49]:
v = np.array([1, 0, 0, 2])

In [50]:
v

array([1, 0, 0, 2])

In [51]:
u + v

array([3, 4, 5, 8])

In [52]:
u * v

array([ 2,  0,  0, 12])

In [53]:
v.shape

(4,)

In [54]:
u.shape

(4,)

### Multiplication

In [55]:
def vector_vector_multiplication(u, v):
    assert u.shape[0] == v.shape[0]
    
    n = u.shape[0]
    
    result = 0.0
    
    for i in range(n):
        result = result + u[i] * v[i]
        
    return result

In [56]:
vector_vector_multiplication(u, v)

14.0

In [57]:
u.dot(v)

14

In [58]:
U = np.array([
    [2, 4, 5, 6],
    [1, 2, 1, 2],
    [3, 1, 2, 1]
])

In [59]:
U

array([[2, 4, 5, 6],
       [1, 2, 1, 2],
       [3, 1, 2, 1]])

In [60]:
def matrix_vector_multiplication(U, v):
    assert U.shape[1] == v.shape[0]
    
    num_rows = U.shape[0]
    
    result = np.zeros(num_rows)
    
    for i in range(num_rows):
        result[i] = vector_vector_multiplication(U[i], v)
        
    return result

In [61]:
matrix_vector_multiplication(U, v)

array([14.,  5.,  5.])

In [62]:
U.dot(v)

array([14,  5,  5])

In [63]:
V = np.array([
    [1, 1, 2],
    [0, 0.5, 1],
    [0, 2, 1],
    [2, 1, 0]
])

In [64]:
def matrix_matrix_multiplication(U, V):
    assert U.shape[1] == V.shape[0]
    
    num_rows = U.shape[0]
    num_cols = V.shape[1]
    
    result = np.zeros((num_rows, num_cols))
    
    for i in range(num_cols):
        vi = V[:, i]
        Uvi = matrix_vector_multiplication(U, vi)
        result[:, i] = Uvi
        
    return result

In [65]:
matrix_matrix_multiplication(U, V)

array([[14. , 20. , 13. ],
       [ 5. ,  6. ,  5. ],
       [ 5. ,  8.5,  9. ]])

In [66]:
U.dot(V)

array([[14. , 20. , 13. ],
       [ 5. ,  6. ,  5. ],
       [ 5. ,  8.5,  9. ]])

### Identity matrix

In [67]:
I = np.eye(3)
I

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [68]:
V.dot(I)

array([[1. , 1. , 2. ],
       [0. , 0.5, 1. ],
       [0. , 2. , 1. ],
       [2. , 1. , 0. ]])

### Inverse

In [69]:
Vs = V[[0, 1, 2]]
Vs

array([[1. , 1. , 2. ],
       [0. , 0.5, 1. ],
       [0. , 2. , 1. ]])

In [70]:
Vs_inv = np.linalg.inv(Vs)
Vs_inv

array([[ 1.        , -2.        ,  0.        ],
       [ 0.        , -0.66666667,  0.66666667],
       [ 0.        ,  1.33333333, -0.33333333]])

In [71]:
Vs_inv.dot(Vs)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Section 1.9 - Introduction to Pandas (https://www.datacamp.com/community/blog/python-pandas-cheat-sheet)

Plan:
* Data Frames
* Series
* Index
* Accessing elements
* Element-wise operations
* Filtering
* String operations
* Summarizing operations
* Missing values
* Grouping
* Getting the NumPy arrays

### Data frames

In [72]:
data = [
    ['Nissan', 'Stanza', 1991, 138, 4, 'MANUAL', 'sedan', 2000],
    ['Hyundai', 'Sonata', 2017, None, 4, 'AUTOMATIC', 'Sedan', 27150],
    ['Lotus', 'Elise', 2010, 218, 4, 'MANUAL', 'convertible', 54990],
    ['GMC', 'Acadia',  2017, 194, 4, 'AUTOMATIC', '4dr SUV', 34450],
    ['Nissan', 'Frontier', 2017, 261, 6, 'MANUAL', 'Pickup', 32340],
]

columns = [
    'Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders',
    'Transmission Type', 'Vehicle_Style', 'MSRP'
]

In [73]:
df = pd.DataFrame(data, columns=columns)
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


Alternatively, we can use a list of dictionaries to create a dataframe:

In [74]:
data = [
    {
        "Make": "Nissan",
        "Model": "Stanza",
        "Year": 1991,
        "Engine HP": 138.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "sedan",
        "MSRP": 2000
    },
    {
        "Make": "Hyundai",
        "Model": "Sonata",
        "Year": 2017,
        "Engine HP": None,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "Sedan",
        "MSRP": 27150
    },
    {
        "Make": "Lotus",
        "Model": "Elise",
        "Year": 2010,
        "Engine HP": 218.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "convertible",
        "MSRP": 54990
    },
    {
        "Make": "GMC",
        "Model": "Acadia",
        "Year": 2017,
        "Engine HP": 194.0,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "4dr SUV",
        "MSRP": 34450
    },
    {
        "Make": "Nissan",
        "Model": "Frontier",
        "Year": 2017,
        "Engine HP": 261.0,
        "Engine Cylinders": 6,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "Pickup",
        "MSRP": 32340
    }
]

In [75]:
df = pd.DataFrame(data)

In [76]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [77]:
df.head(n=2)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150


### Series

In [78]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [79]:
df.Make

0     Nissan
1    Hyundai
2      Lotus
3        GMC
4     Nissan
Name: Make, dtype: object

In [80]:
df['Make']

0     Nissan
1    Hyundai
2      Lotus
3        GMC
4     Nissan
Name: Make, dtype: object

In [81]:
df[['Make', 'Model', 'MSRP']]

Unnamed: 0,Make,Model,MSRP
0,Nissan,Stanza,2000
1,Hyundai,Sonata,27150
2,Lotus,Elise,54990
3,GMC,Acadia,34450
4,Nissan,Frontier,32340


In [82]:
df['id'] = [1, 2, 3, 4, 5]
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP,id
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000,1
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150,2
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990,3
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450,4
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340,5


In [83]:
df['id'] = [10, 20, 30, 40, 50]
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP,id
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000,10
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150,20
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990,30
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450,40
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340,50


In [84]:
del df['id']

In [85]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


### Index & Accessing Elements

In [86]:
df.index

RangeIndex(start=0, stop=5, step=1)

In [87]:
df.Make.index

RangeIndex(start=0, stop=5, step=1)

In [88]:
df.loc[1]

Make                   Hyundai
Model                   Sonata
Year                      2017
Engine HP                  NaN
Engine Cylinders             4
Transmission Type    AUTOMATIC
Vehicle_Style            Sedan
MSRP                     27150
Name: 1, dtype: object

In [89]:
df.loc[[1, 2]]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990


In [90]:
df.index = ['a', 'b', 'c', 'd', 'e']

In [91]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [92]:
df.loc[['b', 'c']]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990


In [93]:
df.iloc[1]

Make                   Hyundai
Model                   Sonata
Year                      2017
Engine HP                  NaN
Engine Cylinders             4
Transmission Type    AUTOMATIC
Vehicle_Style            Sedan
MSRP                     27150
Name: b, dtype: object

In [94]:
df.iloc[[1, 2, 4]]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [95]:
df.reset_index()
df.reset_index(drop=True)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


### Element-wise operations

In [96]:
df['Engine HP'] / 100

a    1.38
b     NaN
c    2.18
d    1.94
e    2.61
Name: Engine HP, dtype: float64

In [97]:
df['Year'] >= 2015

a    False
b     True
c    False
d     True
e     True
Name: Year, dtype: bool

### Filtering

In [98]:
df[df['Year'] >= 2015]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [99]:
df[df['Make'] == 'Nissan']

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [100]:
df[(df['Make'] == 'Nissan') & (df['Year'] >= 2015)]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


### String operations

In [101]:
df['Vehicle_Style']

a          sedan
b          Sedan
c    convertible
d        4dr SUV
e         Pickup
Name: Vehicle_Style, dtype: object

In [102]:
'STR'.lower()

'str'

In [103]:
df['Vehicle_Style'].str.lower()

a          sedan
b          sedan
c    convertible
d        4dr suv
e         pickup
Name: Vehicle_Style, dtype: object

In [104]:
df['Vehicle_Style'].str.replace(' ', '_')

a          sedan
b          Sedan
c    convertible
d        4dr_SUV
e         Pickup
Name: Vehicle_Style, dtype: object

In [105]:
df['Vehicle_Style'].str.replace(' ', '_').str.lower()

a          sedan
b          sedan
c    convertible
d        4dr_suv
e         pickup
Name: Vehicle_Style, dtype: object

In [106]:
df['Vehicle_Style'] = df['Vehicle_Style'].str.replace(' ', '_').str.lower()

In [107]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
b,Hyundai,Sonata,2017,,4,AUTOMATIC,sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr_suv,34450
e,Nissan,Frontier,2017,261.0,6,MANUAL,pickup,32340


### Summarizing operations

In [108]:
df.MSRP.min()

2000

In [109]:
df.MSRP.max()

54990

In [110]:
df.MSRP.max()

54990

In [111]:
df.MSRP.describe()

count        5.000000
mean     30186.000000
std      18985.044904
min       2000.000000
25%      27150.000000
50%      32340.000000
75%      34450.000000
max      54990.000000
Name: MSRP, dtype: float64

In [112]:
df.describe()

Unnamed: 0,Year,Engine HP,Engine Cylinders,MSRP
count,5.0,4.0,5.0,5.0
mean,2010.4,202.75,4.4,30186.0
std,11.260551,51.29896,0.894427,18985.044904
min,1991.0,138.0,4.0,2000.0
25%,2010.0,180.0,4.0,27150.0
50%,2017.0,206.0,4.0,32340.0
75%,2017.0,228.75,4.0,34450.0
max,2017.0,261.0,6.0,54990.0


In [113]:
df.describe().round(2)

Unnamed: 0,Year,Engine HP,Engine Cylinders,MSRP
count,5.0,4.0,5.0,5.0
mean,2010.4,202.75,4.4,30186.0
std,11.26,51.3,0.89,18985.04
min,1991.0,138.0,4.0,2000.0
25%,2010.0,180.0,4.0,27150.0
50%,2017.0,206.0,4.0,32340.0
75%,2017.0,228.75,4.0,34450.0
max,2017.0,261.0,6.0,54990.0


In [114]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
b,Hyundai,Sonata,2017,,4,AUTOMATIC,sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr_suv,34450
e,Nissan,Frontier,2017,261.0,6,MANUAL,pickup,32340


In [115]:
df.Make

a     Nissan
b    Hyundai
c      Lotus
d        GMC
e     Nissan
Name: Make, dtype: object

In [116]:
df.Make.nunique()

4

In [117]:
df.Make.nunique

<bound method IndexOpsMixin.nunique of a     Nissan
b    Hyundai
c      Lotus
d        GMC
e     Nissan
Name: Make, dtype: object>

In [118]:
df.nunique()

Make                 4
Model                5
Year                 3
Engine HP            4
Engine Cylinders     2
Transmission Type    2
Vehicle_Style        4
MSRP                 5
dtype: int64

### Missing values

In [119]:
df.isnull()

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,False,False,False,False,False,False,False,False
b,False,False,False,True,False,False,False,False
c,False,False,False,False,False,False,False,False
d,False,False,False,False,False,False,False,False
e,False,False,False,False,False,False,False,False


In [120]:
df.isnull().sum()

Make                 0
Model                0
Year                 0
Engine HP            1
Engine Cylinders     0
Transmission Type    0
Vehicle_Style        0
MSRP                 0
dtype: int64

### Grouping

```
SELECT transmission_type, AVG(MSRP)
FROM cars
GROUP BY transmission_type
```

In [121]:
df.groupby('Transmission Type').MSRP.mean()

Transmission Type
AUTOMATIC    30800.000000
MANUAL       29776.666667
Name: MSRP, dtype: float64

### Getting the NumPy arrays

In [122]:
df.MSRP.valuesWhat's the number of unique Audi car models in the dataset?

3
16
26
34

Object `dataset` not found.


34

In [123]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
b,Hyundai,Sonata,2017,,4,AUTOMATIC,sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr_suv,34450
e,Nissan,Frontier,2017,261.0,6,MANUAL,pickup,32340


In [124]:
df.to_dict(orient='records')

[{'Make': 'Nissan',
  'Model': 'Stanza',
  'Year': 1991,
  'Engine HP': 138.0,
  'Engine Cylinders': 4,
  'Transmission Type': 'MANUAL',
  'Vehicle_Style': 'sedan',
  'MSRP': 2000},
 {'Make': 'Hyundai',
  'Model': 'Sonata',
  'Year': 2017,
  'Engine HP': nan,
  'Engine Cylinders': 4,
  'Transmission Type': 'AUTOMATIC',
  'Vehicle_Style': 'sedan',
  'MSRP': 27150},
 {'Make': 'Lotus',
  'Model': 'Elise',
  'Year': 2010,
  'Engine HP': 218.0,
  'Engine Cylinders': 4,
  'Transmission Type': 'MANUAL',
  'Vehicle_Style': 'convertible',
  'MSRP': 54990},
 {'Make': 'GMC',
  'Model': 'Acadia',
  'Year': 2017,
  'Engine HP': 194.0,
  'Engine Cylinders': 4,
  'Transmission Type': 'AUTOMATIC',
  'Vehicle_Style': '4dr_suv',
  'MSRP': 34450},
 {'Make': 'Nissan',
  'Model': 'Frontier',
  'Year': 2017,
  'Engine HP': 261.0,
  'Engine Cylinders': 6,
  'Transmission Type': 'MANUAL',
  'Vehicle_Style': 'pickup',
  'MSRP': 32340}]

Section 1.10 - Summary (https://www.slideshare.net/AlexeyGrigorev/ml-zoomcamp-110-summary)