# **Prework: Review**
---
### **Description**
This notebook covers several key areas of coding from Part I to ensure that you are comfortable hitting the ground running on Day 1 in Part II.

<br>

### **Lab Structure**
**Part 1**: [Exploratory Data Analysis Review](#p1)

  >  **Part 1.1**: [Pandas](#p1.1)
  
  > **Part 1.2**: [Matplotlib](#p1.2)

**Part 2**: [Linear Regression](#p2)

**Part 3**: [KNN](#p3)



<br>


### **Resources**
* [Python Basics Cheat Sheet](https://docs.google.com/document/d/1YcbzHos1139dSq-9DO6zYXyza1WXiBF0YdpYlG8yS1k/edit?usp=drive_link)

* [EDA with pandas Cheat Sheet](https://docs.google.com/document/d/1PDHUzx7dHCa6MZBpHFME8IMW3EFtcUj-FHPhi_p6QEE/edit?usp=drive_link)

* [Data Visualization with matplotlib Cheat Sheet](https://docs.google.com/document/d/1kRBEo82pP5r5OJ1npGbfubUw2z8aXIKonjW4PXgY4x8/edit?usp=drive_link)

* [Linear Regression with sklearn Cheat Sheet](https://docs.google.com/document/d/1uc4fq5cjOoT1ohol9SR0020ThksWKFjMMnXsvXIPIWE/edit?usp=drive_link)

* [K-Nearest Neighbors with sklearn Cheat Sheet](https://docs.google.com/document/d/1775XrkO8W5_cxIkd_gkUXOu1dt8xDtNWPZSixMGBKIw/edit?usp=drive_link)


<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt

from sklearn import model_selection
from sklearn import datasets
from sklearn.metrics import *

<a name="p1"></a>

---
## **Part 1: Exploratory Data Analysis**
---




<a name="p1.1"></a>

---
### **Part 1.1: Pandas**
---


**Run the code cell below to create the DataFrame.**

In [None]:
df = pd.DataFrame({'U.S. State': ['California', 'Florida', 'Indiana', 'Texas', 'Pennsylvania'],
        'Population (in millions)': [38, 21, 6.5, 28, 13],
        'Capitol': ['Sacramento', 'Tallahassee', 'Indianapolis', 'Austin', 'Harrisburg'],
        'GDP ($ in billions)': [3700, 1070, 352, 1876, 726]})

#### **Problem #1.1.1**

Inspect what `.head()` tells us about this DataFrame.

#### **Problem #1.1.2**

Determine what datatype `Population (in millions)` is.

#### **Problem #1.1.3**

Print all of the unique values for `GDP ($ in billions)`.

#### **Problem #1.1.4**

Determine the column names in the dataset.

#### **Problem #1.1.5**

Determine the highest `GDP ($ in billions)` in the dataset.

#### **Problem #1.1.6**

Determine which states are included in this dataset.

#### **Problem #1.1.7**

Determine the range of GDP values among the states.

<a name="p1.2"></a>

---
### **Part 1.2: Matplotlib**
---

**Run the cell below to load in the data**

In [None]:
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS9jPkeKJ8QUuAl-fFdg3nJPDP6vx1byvIBl4yW8UZZJ9QEscyALJp1eywKeAg7aAffwdKP63D9osF1/pub?gid=169291584&single=true&output=csv"
movie_df = pd.read_csv(url)

movie_df.drop_duplicates(inplace=True)

mean_runtime = movie_df['Runtime'].mean()
movie_df['Runtime'] = movie_df['Runtime'].fillna(mean_runtime)

movie_df = movie_df.rename(columns = {"Runtime": "Runtime (min)"})
movie_df = movie_df.astype({"Runtime (min)": "int64"})

movie_df.head()

#### **Problem #1.2.1**

Create a scatterplot using `Runtime (min)` as the x-axis value and `Gross` as the y-axis value.

Make sure to include a meaningful:
* `Title`: "Runtime vs. Released_Year:
* `X-axis`: "Gross (USD)"
* `Y-axis`: "Runtime (min)"

#### **Problem #1.2.2**

Create a scatterplot using `Released_Year` as the x-axis value and `Runtime (min)` as the y-axis value.

Make sure to include a meaningful:
* `Title`: "Runtime vs. Released_Year"
* `X-axis`: "Year"
* `Y-axis`: "Runtime (min)"

#### **Problem #1.2.3**

Create a line plot using `Runtime (min)` as the x-axis value and `Gross` as the y-axis value.

Make sure to include a meaningful:
* Title, ex: `'Gross Money vs. Runtime'`.
* X-axis label including units `'min'`.
* Y-axis label including units `'USD'`.

<br>

**NOTE**: This is not going to be a particularly helpful graph (the scatter plot is a better choice), but we oftentimes will not know this ahead of time. A lot of EDA and visualizations involves trying a number of things and seeing what is useful.

#### **Problem #1.2.4**

Create a line plot using `Released_Year` as the x-axis value and `Average Gross in Year` as the y-axis value.

Make sure to include a meaningful:
* Title, ex: `'Average Gross Money vs. Released Year'`.
* X-axis label.
* Y-axis label including units `'USD'`.

In [None]:
mean_gross = movie_df.groupby(# COMPLETE THIS LINE


#### **Problem #1.2.5**

Create a bar plot of the number of movies released per year.

Use the DataFrame provided, `movies_per_year` and make sure to include a meaningful:
* Title.
* X-axis label.
* Y-axis label.

In [None]:
movies_per_year = movie_df['Released_Year'].value_counts()

plt.bar(movies_per_year.index, # COMPLETE THIS CODE

#### **Problem #1.2.6**

Create a bar plot of the number of Dramas released per year.

Use the DataFrame provided, `movies_per_year` and make sure to include a meaningful:
* Title.
* X-axis label.
* Y-axis label.

<br>

**Hint**: Recall that you can use `.loc[CRITERIA, :]` to find all data matching given criteria and the example in Problem #6 for finding the number of movies realeased per year.

In [None]:
# COMPLETE THIS CODE

<a name="p2"></a>

---
## **Part 2: Linear Regression**
---

Using the CO2 Emissions dataset, do the following:
* Build a model that will predict the CO2 emissions of a car;
* Predict the CO2 emissions of a car with a specific volume and weight.

<br>

Since 1970, CO2 emissions have increased by nearly 90%. These elevated CO2 levels cause poor air quality and contribute to climate change. Globally, cars and other transportation vehicles are responsible for about 29% of overall CO2 emissions. This CO2 emissions dataset is a collection of data from cars that contains information on the car's make, model, volume, weight, and how much CO2 it emits.

The features are as follows:
* `Car`: name of car brand
* `Model`: name of car model
* `Volume`: engine size (in cm^3)
* `Weight`: weight of car (in kg)
* `CO2`: amount of CO2 emitted (in g/km)

#### **Step #1: Load the data**

In [None]:
url = "https://raw.githubusercontent.com/the-codingschool/TRAIN/main/emissions/car_emissions.csv"

cars_df = pd.read_csv(url)
cars_df.head()

#### **Step #2: Decide independent and dependent variables**

We are going to use `Volume` and `Weight` as our independent variables for predicting `CO2` emissions.



In [None]:
features = # COMPLETE THIS CODE
labels = # COMPLETE THIS CODE

#### **Step #3: Split data into training and testing data**


In [None]:
# COMPLETE THIS CODE

#### **Step #4: Import your algorithm**


In [None]:
# COMPLETE THIS CODE

#### **Step #5: Initialize your model and set hyperparameters**



In [None]:
# COMPLETE THIS CODE

#### **Step #6: Fit, Test, and Visualize**


In [None]:
model.fit(X_train, # COMPLETE THIS CODE

In [None]:
predictions = # COMPLETE THIS CODE

In [None]:
# VISUALIZE THE TRUE VS. PREDICTED VALUES

#### **Step #7: Evaluate**

Let's evaluate this model and put it to the test! Specifically, evaluate the model using our standard regression metrics: $R^2$, MSE, and MAE.


#### **Step #8: Use the model**

Using the model we created, predict the CO2 emissions of two cars:

* **Car 1:** Volume is 800 cm^3 and weight is 1020 kg

* **Car 2:**  Volume is 1020 cm^3 and weight is 800 kg

<br>

**NOTE**: You must create a dataframe containing with the information of the new cars:

```python
new_car_data = pd.DataFrame(new_car_data_here, columns = ["Volume", "Weight"])
```

In [None]:
# COMPLETE THIS CODE

<a name="p3"></a>

---
## **Part 3: KNN**
---
In this Part, we will use a dataset that contains data collected by astronomers about different classes of stars that have been observed. With KNN, you will use the size and temperature of stars to determine which class they may be from the following:

* `0`: Red Dwarf
* `1`: Brown Dwarf
* `2`: White Dwarf
* `3`: Main Sequence
* `4`: SuperGiants
* `5`: HyperGiants

#### **Step \#1: Load the data**

Run the given code to load and view your data frame.

In [None]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTCZgoegOHa49SFXYU-ZZTdCkgTp0sneU1BsEOa7vusjTXPPLcn0i3kXhX1nyqkApJHCKTkw0mWuWr4/pub?gid=753880827&single=true&output=csv'
stars_df = pd.read_csv(url)

stars_df.head()

#### **Step \#2: Decide independent and dependent variables**

Use the dataframe `stars_df` and subset your data into `inputs` and `output`.

<br>

The `inputs` will be `size` and `temperature`.

The `output` will be `class`.

In [None]:
# COMPLETE THIS CODE

#### **Step \#3: Split data into train and test data**

Let's split your data into training and testing data. Since this is a small dataset, let's just reserve 10% of the data for testing.

In [None]:
# COMPLETE THIS CODE

#### **Step \#4: Import your model**


In [None]:
# COMPLETE THIS CODE

#### **Step \#5: Initialize your model and set hyperparameters**

Build your model with $K=7$.

In [None]:
# COMPLETE THIS CODE

#### **Step \#6: Fit your model and make a prediction**

Train your model with the `x_train` and `y_train` training data and make predictions on `x_test`.

In [None]:
# COMPLETE THIS CODE

#### **Create a visualization**

**Run the code below to visualize the decision boundary of this KNN model.**


In [None]:
fig, ax = plt.subplots(figsize=(10,6))

xx, yy = np.meshgrid(np.arange(0, 2000, 10),
                     np.arange(1900, 40000, 100))
z = knn_model.predict(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)

ax.pcolormesh(xx, yy, z, alpha=0.1)

labels = ['Red Dwarf', 'Brown Dwarf', 'White Dwarf', 'Main Sequence', 'SuperGiants', 'HyperGiants']
for label, data in stars_df.groupby('class'):
  ax.scatter(data["size"], data["temperature"], label=labels[label])

ax.set_title("Decision Boundary of the KNN Classifier")
ax.set_xlabel("Star Size")
ax.set_ylabel("Star Temperature")
ax.legend()
plt.show()

#### **Step #7: Evaluate your model**

Print the accuracy and confusion matrix for your model's performance on the test set.

<br>

**NOTE**: It's not necessary to supply a `display_labels` argument, but if you are curious see if you can use the information provided in this Part to supply them.

In [None]:
print("Accuracy Score: ", # COMPLETE THIS LINE

In [None]:
metrics.ConfusionMatrixDisplay.from_predictions(# COMPLETE THIS LINE

#### **Step \#8: Make predictions**


Astronomers have heard of your amazing ML model for predicting star types and want you to help them categorize new stars they have observed! For each problem below, use your KNN model to classify the stars based on the data given to you.


1. `size`: 708.9, `temperature`: 12100 (`[708.9, 12100]`)

2. `size`: 0.0998, `temperature`:  3484 (`[0.0998, 3484]`)

3. `size`: 6.39, `temperature`:  34190 (`[6.39, 34190]`)

4. `size`: 0.16, `temperature`: 2799 (`[0.16, 2799]`)

In [None]:
# COMPLETE THIS CODE

In [None]:
# COMPLETE THIS CODE

In [None]:
# COMPLETE THIS CODE

In [None]:
# COMPLETE THIS CODE

# End of Notebook

---

© 2024 The Coding School, All rights reserved