# Data Cleaning

## ANTICIPATED TIME:


2 hours

## BEFORE YOU BEGIN


[Crossvalidation](Crossvalidation.ipynb)

## WHAT YOU WILL LEARN


* When to clean data?
* What is missing data?
* When to transform data?
* How to create new variables?
* What is versioning?

## DEFINITIONS YOU’LL NEED TO KNOW



* Outliers - data points that are far outside of a dataset
* Common scale - different features or variables that are adjusted to have similar ranges
* Standardardized variable - has a mean of zero and a standard deviation of 1
* Non-normality - when the distribution of data does not follow an expected pattern
* Versioning - saving different copies of your work to see how things changed

## SCENARIO:

So far, the team has collected a lot of data to solve the pollution problem in their city! Kiana notices that the data that they’ve collected over time isn’t as clean as they thought it was. Some of the data has missing values or different scales, like the air quality measurements in different areas. Diego suggests that they should use data cleaning techniques to fix these issues. He thinks that data cleaning techniques such as transforming the data to make it easier to compare, standardizing pollution levels to ensure they are on the same scale, and even removing outliers that could possibly cause error. There are many different techniques the team can use! The team realizes that by using the proper data cleaning techniques, they can continue to make better informed decisions and solve their pollution issue.

## WHAT DO I NEED TO KNOW?




**Data Cleaning, Transformation, and Versioning**

Most of the datasets we’ve worked with so far either had no mistakes or very few mistakes. Sounds awesome, but unfortunately that’s not always the case! Real-world datasets often need fixing or “cleaning” before we can use them and accurately run our data science methods.
In this notebook, we’ll look at some techniques we’ve already used and add in some new ones.

**Missing Data**

Imagine we are trying to get some information on the people that live in our community. We got lots of great data like their address, but maybe we are missing information like their age or school they go to. Think about all the different ways we might see that…
* None
* na
* nan
* Nan
* NaN
* -NaN
* -nan
* -1
* #IND
* NA
* #N/A
* N/A
* N/A
* n/a
* NA
* #NA
* NULL
* Null

All those might sense when we look at it, but it’s confusing to a computer because it technically see it all as different data points.

It’s important to know that most algorithms can’t handle missing values, so we have to decide what to do when that happens. So what should we do?
* Let’s get rid of them by dropping missing values or
* Replacing them with real values.

*Dropping Missing Values*

We’ve seen this method before. Sometimes we don’t want to lose data because if we remove the data, we might bias the model. So another option is to replace missing data with a convenient value like the mean and median.

*Remove or Replace Missing Values*

The best way to decide what to do when data is missing is to understand why the data is missing in the first place. If there’s a pattern of missing data, it will likely mess up your model.



**Transforming Data**

There are two common reasons to change data:

* Outliers
* Common Scale

*Outliers*

Ever look at a dataset and one just feels….off? Like someone in the 7th grade has an age of 26 years old? Or maybe most people on the basketball team score about 15 points per game, but one player is recorded at 42 points per game. It’s a pretty good bet that datapoint is an outlier.

Why does that matter though? Outliers can have a strong effect on some models, like linear regression. When we identify outliers, it’s important to understand what caused them. For example, maybe it’s simply a measuring error that can be removed or changed. On the other hand, maybe the outlier is showing a real difference in the data that shouldn’t be ignored.

To check if the outlier is “real,” look for their patterns with other variables.

* **Finding outliers using the standard deviation**. If you decide it’s best to drop the outliers, you can find them in a variety of ways. One way is standard deviation. Standard deviation shows how spread out a variable’s values are from the normal distributions. Because most (99.7% to be exact) values are within three standard deviations, so anything outside of that could be used to find outliers. One problem though - mean and standard deviation are affected by outliers, so we have to be careful there.
* **Finding outliers using boxplots**.  In boxplots, outliers are circles outside of the whiskers. In normal distributions without outliers, this is similar to using three standard deviations. With outliers, this method works better because it doesn't rely on mean or standard deviation. Plus it relies on some of the data visualization we talked about earlier.
* **Finding outliers using percentiles**. Another way we can identify outliers is by using percentiles. For example, when we use percentiles, we can call everything below the 1st percentile or above the 99th percentile an outlier.


*Common Scale*

Different variables will naturally be on different scales. For example, height might range from 4 to 7 feet, but weight might range from 80 to 400 pounds. Some models work best when all variables are on the same scale. An example is KNN when all variables are on the same scale. Variables with larger scales contribute more to the distance and so can especially influence the model’s and their decision. We’ll consider ways to put things on a common scale so we can avoid potential errors.

Let’s jump in!


## YOUR TURN

### Goal 1: Importing the Pandas Library Need extra tools to help solve this problem?

Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it as an ‘add-on’. In this case, we bring in pandas, which is a popular library for doing data science stuff.  


#### Blockly


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.




**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.






**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **pd**, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZYmOOpW8AAf7uc?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **pd**.

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZmkVCYWEA4oGso?format=jpg&name=small)
</details>




**Your Turn**: Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “pd” to make it easier to use throughout our notebook

In [None]:
# freehand code


**Explanation**: *Congrats!  Your attempts finally made it!  Now you have successfully imported the "pandas" package as the variable "pd"*.

### Goal 2: Bringing in the Dataframe

Let’s bring in the data that we want to look at.

#### Blockly




**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **data**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

In Blockly, bring in the VARIABLES menu.


**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

In Blockly, go to the Variables and drag the Set block for the **data** variable. This will allow us to assign the result of a function call to the variable. A function is basically code that does a specific task for us.

**Step 3 - Bring in the data**

Now we need to look at the file that has all our data. To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘datasets/AirQualityTrain.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **data**.

From the Variable menu, drag a DO block using the **pd** variable, go ahead with the Do operation **read_csv**. The read_csv function reads a CSV file and returns a DataFrame object.

In our case, let’s bring in the “datasets/AirQualityClean.csv" (use the Quotes from the TEXT menu) because that is what Kiana is working with.

**Step 4 - Display the variable**

Let’s see it now by ‘displaying’ and showing our work.

Drag the **data** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKDKLtakAArIV9?format=jpg&name=small)
</details>

In [None]:
# blockly code


#### Freehand


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us do cool things, let’s bring the data into a variable called **data**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

**Step 3 - Bring in the data**

Now we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called “datasets/AirQualityTrain.csv” folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **data**. For this function, we need to specify the code as “**pd.read_csv**”, which makes the code read the csv file. This variable is now our dataframe!

In our case, let’s bring in the “datasets/AirQualityClean.csv” (user the Quotes from the TEXT menu) because that is what we are working with.

**Step 4 - Print the variable**

Let’s see it now by ‘printing’ and showing our work.

**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKE_cFa8AEEiDu?format=png&name=small)
</details>

**Your Turn**: Now it’s your turn!  Let’s dive in and start working with the data! We’ll begin by loading it into a dataframe, which will allow us to easily interact with and analyze the dataset.

In [None]:
# freehand code


**Explanation**:  *Easy-peasy! You have now brought in the dataframe and stored it as a variable that you can reference later on. Now, onto the fun part*!

### Goal 3: Showing information about the dataframe

The df.info() method provides a concise summary of your DataFrame, including the number of rows and columns, data types, and missing values. It's a valuable tool for understanding your data's structure and identifying potential issues before analysis.  


#### Blockly



**Step 1 - Getting the info from the dataframe**

To get more information,, we first need to select the numeric data to explore. First, let’s tell it what dataframe we want to look at - **data**

From the "VARIABLES" menu, drag a DO block WITH **data** variable. Select the **info** method from the list.   

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKGUMQaAAA98SW?format=png&name=small)
</details>



In [None]:
# blockly code


#### Freehand

**Step 1 - Getting the info from the dataframe**

To get more information,, we first need to select the numeric data to explore. First, let’s tell it what dataframe we want to look at - data

Let’s get info on each of the columns of the dataframe, data. Reviewing each data type and number of records.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKHRJNasAEyKir?format=png&name=small)
</details>


**Your Turn**: Your turn! Give the info() method a try and see what it tells you about your data.

In [None]:
# freehand code


**Explanation**: *You have printed a concise summary of the DataFrame, including the data types of each column, the number of non-null values, and memory usage*.

### Goal 4: Changing the Data Format in a Column type.

We want to look a bit deeper into the data, but sometimes that’s difficult because it’s in the wrong form. Let’s change the data format so that it’s easier to work with.

#### Blockly





**Step 1 - Call the insert() method of the dataframe**

To see the clustering result of each datapoint, we will add the results into our dataframe. This will let us look at the cluster number, set for each datapoint. To do this, we use the insert() method on the ‘dataframe’.

From the Variable menu, get a DO block for the **dataframe** variable. With it, select the insert operation. The **insert** function will add a new column to the dataframe.

**Step 2 - Add parameters into the insert() method**

We can add the new column wherever we want, so let’s add it to the first position, column 0.

From the Math menu, get a number block and change it to **“0” (zero)**. Connect it as the first parameter of the insert operation.

**Step 3 - Display the updated dataframe with the new ‘New_Report_Date’ column**

We can choose whatever name we want for the new column, so let’s call it **New_Report_Date**.

From the Text menu, get a Quote block “” and type,  **New_Report_Date**. Connect it as the second parameter of the insert operation.

**Step 4 - Call the date conversion function**

Let’ use a date conversion function from pandas to create a new date column using the proper date type.

From the Variable menu, drag a DO block for the pd variable and select the **to_datetime** function. Connect that as the third parameter of the insert function.

**Step 5 - Inform the original data column**

Select the date column to be formatted from the dataframe, **data**.

From the List menu, drag a dictVariable block, and select the **data** variable. Get also a Quote block from the Text menu and inform the column nane, **Report_Date**. Connect this block to the first parameter for the to_datetime function.

**Step 6 - Inform the correct format of the date**

As the original column has the dates as a string, we have to inform the used date format to that. In our example here, it is using the European format, day/month/year.

Get a freehand block and type **format=**. Connect that with a Quote block with **“%d/%m/%Y”**. Connect these as the second parameter of the to_datetime function.

**Step 7 - See now the new date column on the dataframe**

Let’s print the info on the columns of the dataframe again and see the newly created date column.

From the Variable menu, drag a DO block for the **data** variable, then select the **info**() method from the list.  





**Step 8 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKH3RGbIAADssk?format=png&name=small)
</details>


In [None]:
# blockly code


#### Freehand

**Step 1 - Inserting the new formatted column**

Let’s create a new column with the formatted date information.
With the **data** dataframe, call the **insert**() method.

`data.insert()`


**Step 2 - Inform the position of the new column**

We can add the new column wherever we want, so let’s add it to the first position, column 0.

Just type **0 (zero)**, as the position of the column. Keep in mind that the first position in Python is position zero.

`data.insert(0)`

**Step 3 - Define the name of the new column**

We can choose whatever name we want for the new column, so let’s call it **New_Report_Date**.

As a string, inform the new column name, ‘New_Report_Date’

`data.insert(0,'New_Report_Date')`

**Step 4 - Call the date conversion function**

Let’s use a date conversion function from pandas to create a new date column using the proper date type.

From the pandas library, pd, call the function to_datetime().

`data.insert(0,'New_Report_Date',pd.to_datetime( ))`

**Step 5 - Inform the original data column**

Select the date column to be formatted from the dataframe, **data**.
Using the **data** dataframe, select the original date column, **Report_Date**.

`data.insert(0,'New_Report_Date',pd.to_datetime(data['Report_Date']))`


**Step 6 - Inform the correct format of the date**

As the original column has the dates as a string, we have to inform the used date format to that. In our example here, it is using the European format, day/month/year. Inform the European data format, **“%d/%m/%Y”**

`data=data.insert(0,'New_Report_Date',pd.to_datetime(data['Report_Date'],format= '%d/%m/%Y'))`

**Step 7 - See now the new date column on the dataframe**

Let’s print the info on the columns of the dataframe again and see the newly created date column. With the **data** dataframa, call the **info**() method.


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKJJ87aIAA-2zA?format=png&name=small)
</details>

**Your Turn**: Let's do it.

In [1]:
# freehand code


**Explanation**: *You have added a 'New_Report_Date' column (as datetime objects converted from 'Report_Date' with the European format, day/month/year, '%d/%m/%Y') at the beginning of the DataFrame, and data.info() then displays a concise summary of the updated DataFrame, including the new column's information*.

### Goal 5: Finding the Errors in our Data.

When we work with our data, it’s really important it is in the right format. One quick way to do that is by filtering for things like typos and outliers in the data.

#### Blockly

**Step 1 - Filter rows from the dataframe**

Our goal is to work with numbers in our dataset, but sometimes we might have words (strings). Let’s filter so that we can find them easy. Let’s get started by saying what dataframe we want to look at, which will be **data**

From the List menu, drag a dictVariable and select **data** (dataframe).

**Step 2 - Define the base column to perform the filtering**

Now let’s tell it what column (Annual_Prouction) we want to dig into.
From the List menu, drag a dictVariable and select **data** (dataframe). Connect to that a Quote block with the column name, **Annual_Production**.

**Step 3 - Define the criteria to perform the filtering**

Let’s wrap it up by saying what information we are looking for in the column. In this case, we want to look at everything labels “no production”.

From the Logical menu, get an equals block (=), and on the right part of it, connect a Quote block with the criteria name “no production”. On the left part, connect the block with the column definition, data[“Annual_Production”].

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKK6c_bEAATabF?format=png&name=small)
</details>


In [None]:
# blockly code


#### Freehand


**Step 1 - Filter rows from the dataframe**

Our goal is to work with numbers in our dataset, but sometimes we might have words (strings). Let’s filter so that we can find them easy. Let’s get started by saying what dataframe we want to look at, which will be **data**

We have to identify the rows that don’t have the expected information and production numbers but a text record of “no production”. To do that, let’s filter the rows that have this text.
To filter rows, use the brackets [] on the **data** dataframe.

`data[  ]`

 **Step 2 - Define the base column to perform the filtering**

Now let’s tell it what column (Annual_Prouction) we want to dig into.
From the List menu, drag a dictVariable and select **data** (dataframe).

Let’s filter based on the **Annual_Prouction** column, checking which rows have this column as “**no production”**.
The brackets [] are used to select columns, in our case, the Annual_Production, in addition to filtering rows.

`data[(data['Annual_Production'] ]`

**Step 3 - Define the criteria to perform the filtering**

Let’s wrap it up by saying what information we are looking for in the column. In this case, we want to look at everything labeled “no production”.

Inform the value of the column to filter the rows, **"no production"**.


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKMCq_awAAvrj4?format=png&name=small)
</details>

**Your Turn**: Let’s find and remove the rows that say "no production" so we can work with just the numbers in our data!




In [None]:
# freehand code



**Explanation**:  *First, we’ll filter the DataFrame df to remove rows where the Annual_Production column has the value "no production". This helps clean the data so the Annual_Production column only includes numbers we can work with*!

### Goal 6: Filter out invalid rows.

Now that we’ve found the rows we don’t want, let’s get ride of them.

#### Blockly


**Step 1 - Filter rows from the dataframe**

Our goal is to work with numbers in our dataset, but sometimes we might have words (strings). Let’s filter so that we can find them easy. Let’s get started by saying what dataframe we want to look at, which will be **data**

From the List menu, drag a dictVariable and select **data** (dataframe).

**Step 2 - Define the base column to perform the filtering**

Now let’s tell it what column (Annual_Prouction) we want to dig into.
From the List menu, drag a dictVariable and select **data** (dataframe). Connect to that a Quote block with the column name, **Annual_Production**.

**Step 3 - Define the criteria to perform the filtering**

Because we are only interested in the numbers, let’s get rid of anything that says “**no production**”

Inform the value of the column to filter the rows, “no production”.
From the Logical menu, get a **not equals block (≠)**, and on the right part of it, connect a Quote block with the criteria name “no production”. On the left part, connect the block with the column definition, data[“Annual_Production”].

**Step 4 - Re-assign the filter dataframe to the data variable**

Now that we finally have only the numbers and gotten rid of the data we are not interested in (“no production”). So let’s overwrite data **now** that we have the valid numbers.

**Step 5 - Display the variable**

Let’s display the variable we just created to see what we get.

From the Variables menu, drag the **data** variable to a new workspace to print the cleaned dataframe for further analysis.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKNUEdbAAAe18k?format=png&name=small)
</details>


In [None]:
# blockly code


#### Freehand


**Step 1 - Filter rows from the dataframe**

Our goal is to work with numbers in our dataset, but sometimes we might have words (strings). Let’s filter so that we can find them easy.

We have to identify the rows that don’t have the expected information and production numbers but a text record of “no production”. To do that, let’s filter the rows that have this text.

To filter rows, use the brackets [] on the **data** dataframe.

`data[  ]`


**Step 2 - Define the base column to perform the filtering**

Now let’s tell it what column (Annual_Prouction) we want to dig into.

Let’s filter based on the **Annual_Prouction** column, checking which rows have this column as “**no production**”.

The brackets [] are used to select columns, in our case, the Annual_Production, in addition to filtering rows.

`data[(data['Annual_Production'] ]`

**Step 3 - Define the criteria to perform the filtering**

Because we are only interested in the numbers, let’s get rid of anything that says “**no production**”

Inform the value of the column to filter the rows, “no production”.

`data[(data['Annual_Production'] != 'no production')]`

**Step 4 - Re-assign the filter dataframe to the data variable**

Now that we finally have only the numbers and gotten rid of the data we are not interested in (“no production”).

Once we have the filtered dataframe, let’s re-assign it to the **data** variable.

`data = data[(data['Annual_Production'] != 'no production')]`


**Step 5 - Display the variable**

Let’s display the variable we just created to see what we get.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKOCAdbgAAd8VQ?format=png&name=small)
</details>

**Your Turn**: Let’s clean up the data! Try filtering the “data” dataframe so it only keeps the rows where Annual_Production is not “no production”. Then, update the “data” variable and display it to see your cleaned-up results!






In [None]:
# freehand code


**Explanation**: *We are keeping only the rows where the value in the 'Annual_Production' column is not equal to the string 'no production', and then reassigns this filtered DataFrame back to the variable data*.

### Goal 7: Convert Numeric Column Before we convert a date to make it easier to work with our data.

The problem is that the system still sees it as a mix of numbers and text. Let’s tell the system that we only have numbers now in the column.

#### Blockly




**Step 1 - Inserting the new formatted column**

Let’s create a new column with the formatted date information.
From the Variables menu, drag a DO block for the **data** variable. Then select the method **insert**.  

**Step 2 - Inform the position of the new column**

We can add the new column wherever we want, so let’s add it to the first position, column 0.
From the Math menu, get a number block and change it to **“0” (zero)**. Connect it as the first parameter of the insert operation.

**Step 3 - Define the name of the new column**

We can choose whatever name we want for the new column, so let’s call it **New_Annual_Production**.

From the Text menu, get a Quote block “” and type,  **New_Annual_Production**. Connect it as the second parameter of the insert operation.

**Step 4 - Call the numeric conversion function**

Let’s use a date conversion function from pandas to create a new date column using the proper date type.
From the Variable menu, drag a DO block for the pd variable and select the **to_numeric** function. Connect that as the third parameter of the insert function.

**Step 5 - Inform the original data column**

Select the date column to be formatted from the dataframe, data.
From the List menu, drag a dictVariable block and select the **data** variable. Also, get a Quote block from the Text menu and inform the column name, **Annual_Production**. Connect this block to the first parameter for the to_numeric function.

**Step 6 - See now the new date column on the dataframe**

Let’s print the info on the columns of the dataframe again and see the newly created date column.

From the Variable menu, drag a DO block for the **data** variable, then select the info() method from the list.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKOr94agAECEPj?format=png&name=small)
</details>


In [None]:
# blockly code


#### Freehand


**Step 1 - Inserting the new formatted column**

Let’s create a new column with the formatted date information.
With the data dataframe, call the **insert**() method.

`data.insert()`

**Step 2 - Inform the position of the new column**

We can add the new column wherever we want, so let’s add it to the first position, column 0.
Just type **0 (zero)**, as the position of the column. Keep in mind that the first position in Python is position zero.

`data.insert(0)`


**Step 3 - Define the name of the new column**

We can choose whatever name we want for the new column, so let’s call it **New_Annual_Production**.
As a string, inform the new column name, New_Annual_Production

`data.insert(0,'New_Annual_Production' )`

**Step 4 - Call the numeric conversion function**

Let’s use a date conversion function from pandas to create a new date column using the proper date type.
From the pandas library, **pd**, call the function **to_numeric**().

`data.insert(0,'New_Annual_Production',pd.to_numeric())`

**Step 5 - Inform the original data column**

Select the date column to be formatted from the dataframe, **data**.
Using the **data** dataframe, select the original date column, **Annual_Production**.

`data.insert(0,'New_Annual_Production',pd.to_numeric(data['Annual_Production']))`

**Step 6 - See now the new date column on the dataframe**

Let’s print the info on the dataframe's columns again and see the newly created date column.
With the data dataframa, call the **info**() method.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKPuTCaIAAy9Fd?format=png&name=small)
</details>

**Your Turn**: Add a new column called New_Annual_Production at the start of the dataframe using pd.to_numeric on Annual_Production. Then use data.info() to check that it worked!


In [None]:
# freehand code


**Explanation**: *You have converted the Annual_Production column to a floating-point number format (float64). This is necessary for performing numerical calculations and statistical analysis on the income data. By converting the column to a numeric type, you can calculate averages, standard deviations, and other statistical measures*.

### Goal 8: Removing null rows. 

You might remember when we worked with null datas from the notebook Data Science and the Nature of Data. The null values can really throw off our analysis, so let’s get rid of them. 

#### Blockly


**Step 1 - Drop all the null rows**

For rows that still have null values, let’s remove all of those by bringing in the **dropna** method. This will get rid of everything for us!
From the Variable menu, drag a Do block. With that select the method, **dropna**.   



**Step 2 - Re-assign the cleaned dataset back to the variable**

Now that it’s all gone, let’s save the cleaned dataset to the same variable, **data**. Now everything is nice and clean with only numeric values and no missing values.

From the Variable menu, drag a Set block of the **data** variable.Connect the dropna block to it.

**Step 3 - Print the cleaned dataset**

Let’s now print the cleaned dataset on the screen.

From the Variable menu, drag a **data** variable block.   

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKRzxCawAAkuXg?format=png&name=small)
</details>


In [None]:
# blockly code





#### Freehand



**Step 1 - Drop all the null rows**

For rows that still have null values (NA), let’s remove all of thoseby bringing in the **dropna** method. This will get rid of everything for us!

From the Variable menu, drag a Do block. With that select the method, **dropna**.   

`data.dropna()`

**Step 2 - Re-assign the cleaned dataset back to the variable **

Now that it’s all gone, let’s save the cleaned dataset to the same variable, **data**. Now everything is nice and clean with only numeric values and no missing values.

From the Variable menu, drag a Set block of the **data** variable.
Connect the dropna block to it.

`data = data.dropna()`

**Step 3 - Print the cleaned dataset**

Let’s now print the cleaned dataset on the screen.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKSg6gagAAww5o?format=png&name=small)
</details>

**Your Turn**: Use dropna() to remove rows with missing values, reassign it back to data, and then print data to see your cleaned-up dataset!

In [None]:
# freehand code


**Explanation**: *You have removed rows from a Pandas DataFrame that contains any missing values. Missing values are typically represented as NaN (Not a Number)*.

### Goal 9: Save a column in a variable.

Sometimes with data science we want to separate a column so that it’s easier to work with. Let’s go ahead and do that by separating a column and putting it into a separate variable named Facility_Age

#### Blockly

**Step 1 - Tell the system what column you want to work with**

Let’s select the column that we want to analyse closer, Facility_Age.

From the List menu, drag a dictVariable, and select the **data** variable. From the Text menu, get a Quote block, and type **Facility_Age**.

**Step 2 - Assigning this column to a new variable**

Now that we have the variable, let’s save this column to a new variable that we can reference later.

In the variable menu create a new variable, **Facility_Age**. Drag the Set block of it, and connect it with the data[“Facility_Age”] block.

**Step 3 - Print the variable**

Let’s now print the cleaned dataset on the screen.
From the Variable menu, drag the variable **Facility_Age**.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKiY6dboAAjQ8k?format=jpg&name=small)
</details>



In [None]:
# blockly code


#### Freehand

**Step 1 - Tell the system what column you want to work with**

Let’s select the column that we want to analyse closer, Facility_Age.

From the List menu, drag a dictVariable, and select the **data** variable. From the Text menu, get a Quote block, and type **Facility_Age**.

`data['Facility_Age']`

**Step 2 - Assigning this column to a new variable**

Now that we have the variable, let’s save this column to a new variable that we can reference later.

In the variable menu create a new variable, **Facility_Age**. Drag the Set block of it, and connect it with the data[“Facility_Age”] block.

`Facility_Age = data['Facility_Age']`

**Step 3 - Print the variable**

Let’s now print the cleaned dataset on the screen.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKjG1gbAAAGA3V?format=png&name=small)
</details>

**Your Turn**: Give it a go! Select the Facility_Age column, assign it to a new variable, and then print that variable to see the values!

In [None]:
# freehand code


**Explanation**: *You have extracted the data from the 'Facility_Age' column of the DataFrame data. Then, the second line, simply Facility_Age, displays the contents of this newly created Series, showing the individual facility*.

### Goal 10: Describe the Boxplot

Now that we’ve described it, let’s see if we can get a data visualization to help see things like the median, outliers, and quartiles for the numerical features.

#### Blockly




**Step 1 -  Saying what data to use for the boxplot plot**

In order to make a plot, we need to choose the source from which data we want to plot from.  In this case, our data is stored in the column **Facility_Age**

From the Variables menu, drag a Do block for the **Facility_Age** variable. With that select the method, plot.

**Step 2 - Calling the plot function and telling it to do a boxplot**

We’ll need to tell our system to create a plot. As you can guess, there are lots of different kinds, so we’ll need to tell the plot function that we are specifically interested in a bloxplot.
Get a freehand block and type **kind=**. Connect to it a Quote block (from the Text menu) and type **“box”** (boxplot).

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKvc52aUAADj51?format=png&name=small)
</details>


In [None]:
# blockly code


#### Freehand

**Step 1 -  Saying what data to use for the boxplot plot**

In order to make a plot, we need to choose the source from which data we want to plot from.  In this case, our data is stored in the column **Facility_Age**

From the Variables menu, drag a Do block for the **Facility_Age** variable. With that select the method, **plot**.

`facility_Age.plot()`

**Step 2 - Calling the plot function and telling it to do a boxplot**

We’ll need to tell our system to create a plot. As you can guess, there are lots of different kinds, so we’ll need to tell the plot function that we are specifically interested in a bloxplot.


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKwFUnbgAAl6ej?format=png&name=small)
</details>


**Your Turn**: Let’s dive in!

In [None]:
# freehand code


**Explanation**: *In the boxplot represents the middle 50% of the data, from Q1 to Q3. The line inside the box is the median. The whiskers extend from the box to the minimum and maximum values, 1 but often they are limited to a certain range to exclude outliers. Outliers are data points that are significantly different from the rest of the data and are often plotted as individual points beyond the whiskers*.

### Goal 11: Finding the Spread

So we’ve got some insights from the boxplots, but how spread out is our data? Let’s calculate the quartiles of a dataset to understand the distribution of the data. This involves selecting the numeric data from the dataframe and applying the describe() method.

**Calculate Quartiles**: Let’s break up our data that looks at anything below 25% of our data (first quartile) and anything above 75% (third quartile) of our data.

#### Blockly


**Step 1 - Saying what data to look at for the quartiles calculation**

To describe the data, we first need to select the numeric data to explore. In this case, our data is the whole data dataframe

From the "VARIABLES" menu, drag a DO block WITH dataframe variable. First, we first need to select the numeric data to describe. In this case, our data is the whole data **“Facility_Age”**

**Step 2 - Call the describe method for the dataframe**

In the next step, we apply the **describe**() to calculate the mean of all the numeric columns in our dataframe.

**Step 3 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKwe9EbsAAanZ2?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand


**Step 1 - Saying what data to look at for the quartiles calculation**

To describe the data, we first need to select the numeric data to explore. In this case, our data is the whole data dataframe.

`Facility_Age`


**Step 2 - Call the describe method for the dataframe**

In the next step, we apply the **describe**() method to describe all the numeric columns in our dataframe.

`Facility_Age.describe()`

**Step 3 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKw3Edb0AAAf_5?format=png&name=small)
</details>

**Your Turn**: Your turn! Give the describe() method a try and see what it tells you about your data.

In [None]:
# freehand code


**Explanation**: *Quartiles divide a dataset into four equal parts. There are three main quartiles: First Quartile (Q1): 25% of the data falls below this value; Second Quartile (Q2): Also known as the median, 50% of the data falls below this value; Third Quartile (Q3): 75% of the data falls below this value*.

### Goal 12: Let’s remove the outliers.

You might remember the idea of outliers from Data Science and the Nature of Data. Basically, a kind of extreme data points that can throw off our analysis. Let’s us the clip method to remove the outliers, and create a new column with the values without the outliers.

#### Blockly




**Step 1 - Inserting the new formatted column**

Let’s create a new column with the formatted date information.
From the Variables menu, drag a DO block for the **data** variable. Then select the method **insert**.  

**Step 2 - Inform the position of the new column**

We can add the new column wherever we want, so let’s add it to the first position, column 0.
From the Math menu, get a number block and change it to **“0” (zero)**. Connect it as the first parameter of the insert operation.

**Step 3 - Define the name of the new column**

We can choose whatever name we want for the new column, so let’s call it **New_Facility_Age**.

From the Text menu, get a Quote block “” and type,  **New_Facility_Age**. Connect it as the second parameter of the insert operation.

**Step 4 - Call the clip method**

Let’s use the call method to remove all values lower than 0 and higher than 50.

From the Variables menu, drag a Do block of the Facility_Age. With that, select the **clip** method. From the Math menu, and drag two number blocks, inform to each one 0 and 50 respectively. Connect each as parameters of the **clip** method, and connect the whole block as the third parameter of the insert function.

**Step 5 - Print the data dataframe**

Let’s print the info on the columns of the dataframe again and see the newly created date column.

From the Variable menu, drag a DO block for the **data** variable.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKxYaDbIAAlLf9?format=png&name=small)
</details>


In [None]:
# blockly code


#### Freehand

**Step 1 - Inserting the new formatted column**

Let’s create a new column with the formatted date information.
With the **data** dataframe, call the **insert**() method.

`data.insert()`

**Step 2 - Inform the position of the new column**

We can add the new column wherever we want, so let’s add it to the first position, column 0.

Just type **0 (zero)**, as the position of the column. Keep in mind that the first position in Python is position zero.

`data.insert(0)`

**Step 3 - Define the name of the new column**

We can choose whatever name we want for the new column, so let’s call it **New_Facility_Age**.

`data.insert(0,’New_Facility_Age’ )`

**Step 4 - Call the clip method**

Let’s use the call method to remove all values lower than 0 and higher than 50.

`data.insert(0,'New_Facility_Age',Facility_Age.clip(0,50))`

**Step 5 - Print the data dataframe**

Let’s print the info on the dataframe's columns again and see the newly created date column.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnKyv_AWMAAAKCX?format=png&name=small)
</details>


**Your Turn**: Now you try it out! Add a new column called New_Facility_Age to your dataframe, use the clip method to keep values between 0 and 50, and then print the updated dataframe to see your changes!


In [None]:
# freehand code



**Explanation**: *The clip() function can handle outliers in the 'Age' column using the Interquartile Range (IQR) method. It calculates the lower and upper bounds based on the IQR and then clips the values to these bounds, effectively removing outliers*.


### Goal 13: Taking a String Variable and Making it 0 (False) or 1 (True).

Sometimes we have string variables that actually represent different categories. If we transform this into categorical variables, we can do some interesting analysis. In our dataset, we have categories like ‘Rural’ and ‘Urban’. Now let’s see if we can look at whether something in a column is ‘rural’ and then code it as 0 (False) or 1 (True).

#### Blockly

**Step 1 - Inserting the new formatted column**

Let’s create a new column with the formatted date information.
From the Variables menu, drag a DO block for the **data** variable. Then select the method insert.  


**Step 2 - Inform the position of the new column**

We can add the new column wherever we want, so let’s add it to the first position, column 0.

From the Math menu, get a number block and change it to **“0” (zero)**. Connect it as the first parameter of the insert operation.

**Step 3 - Define the name of the new column**

We can choose whatever name we want for the new column, so let’s call it **Is_Rural**.

From the Text menu, get a Quote block “” and type,  **Is_Rural**. Connect it as the second parameter of the insert operation.

**Step 4 - Select the original column**

Let’s go back to the column that has the data (Region_Type) we want to look at further.

From the List menu, drag a dictVariable for the **data** variable. Get a Quote block from the Text menu and type **Region_Type**.

**Step 5 - Compare this column with the value “Rural”**

What are we looking for? In this case, we want anything that has “Rural” as the data.

From the Logic menu, drag a equals block (=), connect on right side of it a Quote block with “Rural”. And on the left side the data[“Region_Type”] block. Connect this whole block as the third parameter of the insert function.


**Step 6 - Print the data dataframe**

Let’s print the info on the columns of the dataframe again and see the newly created date column.

From the Variable menu, drag a DO block for the **data** variable.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnK0R1TWEAA5D8G?format=png&name=small)
</details>


In [2]:
# blockly code


#### Freehand


**Step 1 - Inserting the new formatted column**

Let’s create a new column with the formatted date information.

`data.insert() `


**Step 2 - Inform the position of the new column**

We can add the new column wherever we want, so let’s add it to the first position, **column 0**.

`data.insert(0)`


**Step 3 - Define the name of the new column**

We can choose whatever name we want for the new column, so let’s call it **Is_Rural**.

`data.insert(0,'Is_Rural' )`

**Step 4 - Select the original column**

Let’s go back to the column that has the data (Region_Type) we want to look at further.

Let’s select the original column, Region_Type, and compare if it is equal to **“Rural”**.

`data.insert(0,'Is_Rural',(data['Region_Type'] ))`

**Step 5 - Compare this column with the value “Rural”**

What are we looking for? In this case, we want anything that has **“Rural”** as the data.

`data.insert(0,'Is_Rural',(data['Region_Type'] == 'Rural'))`

**Step 6 - Print the data dataframe**

Let’s print the info on the columns of the dataframe again and see the newly created date column.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnLAF2FXkAAPIVQ?format=png&name=small)
</details>

**Your Turn**: Now ok.

In [None]:
# freehand code




**Explanation**: *You have created a new binary column for each category. If a data point belongs to a category, the corresponding column is assigned a value of True, otherwise, it's False*.

### Goal 14: Working with lots of categorical variables.

Before we looked at a column (Rural) and said whether it was false (0) or true (1). But what happens if we have more than 2 categories? When we have many categories, we are going to switch it up where each category has it’s own separate column, what we call dummy columns. This will make it easier to work with.


#### Blockly

**Step 1 - Getting get_dummies from pandas**

To get started with the transformation. What’s great is that Pandas has a get_dummies method to make it easy. Let’s bring that in.
From the Variable menu, get a Do block for the **pd** variable. With that select the **get_dummies** method.

**Step 2 - Telling Pandas what Data Source to Use**

Now that we’ve told pandas we want to create some dummy variables, let’s tell it what data source we want to work with. In this case, we will be using **data**

As the first parameter connect the variable **data**. And for the second parameter connect

**Step 3 - Telling Pandas what Columns to Find the Categories**

So which columns have all the categories we want to break out?

Drag a Quote block and type **Energy_Source**.   


**Step 4 - Add the new columns to the original dataframe**

So far we’ve separated out the categories into a new column. Now let’s add them back into the dataframe (**data**) so it’s all in one place.

From the Variable meun, drag a Set block for the data variable, and connect it to the **get_dummies** function block.


**Step 6 - Print the variable**

Let’s see it now by ‘printing’ and showing our work.From the Variable meun, drag the data variable.

<details>
    <summary>Click to see the answer...</summary>

![](
https://pbs.twimg.com/media/Grf97gHWkAEtsRX?format=jpg&name=small)
</details>



In [None]:
# blockly code


#### Freehand

**Step 1 - Getting get_dummies from pandas**

To get started with the transformation. What’s great is that Pandas has a **get_dummies** method to make it easy. Let’s bring that in.

From the Variable meun, get a Do block for the **pd** variable. With that select the **get_dummies** method. As the first parameter connect the variable data. And for the second parameter connect a Quote block and type Energy_Source.  

`data = pd.get_dummies( )`


**Step 2 - Telling Pandas what Data Source to Use**

Now that we’ve told pandas we want to create some dummy variables, let’s tell it what data source we want to work with. In this case, we will be using data

The get_dummies method with change the dataframe adding the new columns. Let’s save this updated dataframe into the original variable, **data**.

`data = pd.get_dummies(data,columns= ['Energy_Source'])`

**Step 3 - Print the updated dataframe**

Let’s print the updated dataframe with the dummy columns.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Grf943oXkAAcXAU?format=jpg&name=small)
</details>

**Your Turn**: Now ok.

In [None]:
# freehand code



**Explanation**: *pd.get_dummies() is a powerful function in Pandas used to perform one-hot encoding on categorical variables. It converts categorical data into numerical format, suitable for machine learning algorithms*.


### Goal 15: Removing the unnecessary columns

With the dataframe fully cleaned and with new columns created, let’s remove all the unnecessary columns. And prepare to use this clean version of the dataset to perform a model training.

#### Blockly




**Step 1 - Drop the unnecessary columns**

Now that we’ve done the conversion of our columns, we can remove the original ones so we don’t get confused. This way we only keep what we are going to work with

From the Variables menu, drag a Do block for the **data** variable. With that, select the **drop** method.  

**Step 2 - Tell what columns we want to drop**

Now that we have the drop method ready to go, let’s tell it what we want to get rid of. In this case, we’ll get rid of the ‘Contaminated’

From the List menu, get a Create List With block, with 6 entries. For each one of those, get a Quote block (from Text menu), and type the following columns names: Region_Type', 'Facility_Age', 'Annual_Production', 'Report_Date', 'Certification_Level', 'New_Report_Date’. Connect this list as the first parameter of the drop operation.


**Step 3 - Run the command to drop the column**

The command to drop the column is **axis=1**, so let’s add that to our code.

Get a Freehand block and type, **axis=1**. Connect it as the second parameter of the drop operation.

**Step 4 -  Overwriting our dataframe so it only has the variables/columns that we need**

In the original **data** dataframe, we had a bunch of extra stuff that we cleaned up. Let’s resave data with only the columns that we want to work with.

From the Variables menu, bring in the **data** variable. Get the Set block of it, and connect to the drop block.


**Step 5 - Print the updated dataset**

Drag the **data** variable to the canvas so it will print it on the screen.

**Step 6 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnLEMitWEAAC6_g?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand


**Step 1 - Drop the unnecessary columns**

Now that we’ve done the conversion of our columns, we can remove the original ones so we don’t get confused. This way we only keep what we are going to work with

`data.drop( )`

**Step 2 - Tell what columns we want to drop**

Now that we have the drop method ready to go, let’s tell it what we want to get rid of. In this case, we’ll get rid of the ‘Contaminated’

`data.drop(['Region_Type', 'Facility_Age', 'Annual_Production', 'Report_Date', 'Certification_Level', 'New_Report_Date'])`

**Step 3 - Run the command to drop the column**

The command to drop the column is **axis=1**, so let’s add that to our code.

`data.drop(['Region_Type', 'Facility_Age', 'Annual_Production', 'Report_Date', 'Certification_Level', 'New_Report_Date'],axis=1)`

**Step 4 - Overwriting our dataframe so it only has the variables/columns that we need**  

In the original data dataframe, we had a bunch of extra stuff that we cleaned up. Let’s resave data with only the columns that we want to work with.

Now with the cleaned dataset let’s save it to the original variable, **data**.

`data = data.drop(['Region_Type', 'Facility_Age', 'Annual_Production', 'Report_Date', 'Certification_Level', 'New_Report_Date'],axis=1)`


**Step 5 - Print the updated dataset**

Let’s print our final dataframe

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnLFHx4WEAA2lME?format=jpg&name=small)
</details>

**Your Turn**: Give it a try—cleaning up your dataset is like clearing off your desk so you can focus and do your best work with just the important stuff!

In [None]:
# freehand code




**Explanation**: *You have removed the columns named 'Region_Type', 'Facility_Age', 'Annual_Production', 'Report_Date', 'Certification_Level', and 'New_Report_Date' from the DataFrame called data. The axis=1 argument specifies that we are dropping columns, not rows.  The result is then assigned back to the data variable, effectively updating the DataFrame to exclude those columns.  The line data then displays the modified DataFrame*.

### Goal 16: Define the label column.

Now that we’ve removed the gotten rid of the extra columns we don’t need, let’s put the thing we are trying to predict in a separate variable. That will make it easier for us to work with in the future.



#### Blockly


**Step 1 - Tell it what dataset and variable to work with**

So what are we going to predict? We are going to use a bunch of other variables to predict Emission_Offset from the data dataset

From the List menu, get a dictVariable for the **data** variable. Get also a Quote block and type, **Emission_Offset**. Finally, connect them to each other.

**Step 2 -  Saving our predicted variable in its own new variable**

Sometimes when we do math, people will often predict variable and call it Y. Based on a bunch of other variables, let’s try to predict Emissions_Offset and put that prediction into a new variable called **Y**

From the Variables menu, create a new variable, Y. Get the Set block of it, and connect to the data[“Emission_Offset”] block.

**Step 3 -  Print the variable**

Now let’s see what all the stuff we put in Y looks like.
From the Variables menu, drag the block Y.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnOUeKeWoAAjESL?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand

**Step 1 - Tell it what dataset and variable to work with**

So what are we going to predict? We are going to use a bunch of other variables to predict Emission_Offset from the data dataset

`data['Emissions_Offset']`


**Step 2 -  Saving our predicted variable in its own new variable**

Sometimes when we do math, people will often predict variables and call it Y. Based on a bunch of other variables, let’s try to predict Emissions_Offset and put that prediction into a new variable called **Y**

`Y = data['Emissions_Offset']`




**Step 3 -  Print the variable**

Now let’s see what all the stuff we put in Y looks like.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnOVGyfW4AAI_i4?format=png&name=small)
</details>



**Your Turn**: We’re going to guess the Emission_Offset using other info from our data, so let’s save it in a new variable called Y and see what it looks like on the screen!


In [None]:
# freehand code



**Explanation**: *You have created a new variable Y that holds the data from the 'Emissions_Offset' column of the DataFrame data*.

### Goal 17: Define the Feature Columns

Save the other columns, the feature one, to prepare for the model training.

#### Blockly


**Step 1 - Drop the unnecessary columns**

Now that we’ve done the conversion of our columns, we can remove the original ones so we don’t get confused. This way we only keep what we are going to work with

From the Variables menu, drag a Do block for the **data** variable. With that, select the **drop** method.  

**Step 2 - Tell what columns we want to drop**

Now that we have the drop method ready to go, let’s tell it what we want to get rid of. In this case, we’ll get rid of the Emission_Offset

From the Text menu, get a Quote block and type **‘Emission_Offset’’**. Connect this list as the first parameter of the drop operation.


**Step 3 - Run the command to drop the column**

The command to drop the column is **axis=1**, so let’s add that to our code.

Get a Freehand block and type, **axis=1**. Connect it as the second parameter of the drop operation.

**Step 4 -  Saving our feature columns in its own new variable**

Earlier we included a bunch of columns we will use to predict Emissions_Offset. Let’s put them all into a single variable so it’s easier to work with. We’ll call that X

From the Variables menu, create a new variable, **X**. Get the Set block of it, and connect to the drop block.

**Step 5 - Print the updated dataset**

Drag the **data** variable to the canvas so it will print it on the screen.




**Step 6 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnOVlRVWoAAxIQr?format=png&name=small)
</details>






In [None]:
# blockly code


#### Freehand


**Step 1 - Drop the unnecessary columns**

Now that we’ve done the conversion of our columns, we can remove the original ones so we don’t get confused. This way we only keep what we are going to work with

As the first pamater, get a Quote block and inform the column to be dropped, **Emission_Offset**. And as the second parameter, get a freehand block and type, **axis=1**.

`data.drop('Emissions_Offset',axis=1)`

**Step 2 -  Tell what columns we want to drop**

Now that we have the drop method ready to go, let’s tell it what we want to get rid of. In this case, we’ll get rid of the **Emission_Offset**

`X = data.drop('Emissions_Offset',axis=1)`

**Step 3 -  Print the updated dataset**

Print the label saved on the X variable.

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GnOWedKXIAAs7kw?format=png&name=small)
</details>


**Your Turn**: Let’s clean up our data by dropping the extra stuff we don’t need, so it’s easier to focus on what really matters and watch your code work like magic!

In [None]:
# freehand code




**Explanation**: *You have created a new variable X which is a copy of the DataFrame data but with the 'Emissions_Offset' column removed*.


### Goal 18: Import the linear model library.

Let’s bring in a library/package to help with the linear regression that will help us with our analysis and other data science tasks.



#### Blockly


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.linear_model**, which will bring in some cool data manipulation features.


**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into lm, and we type it in the open area.






**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GbFDOz7XoAAVLtr?format=png&name=small)
</details>



In [None]:
# blockly code


#### Freehand


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to **“import”** to bring the add-on package in.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.linear_model**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **lm**.

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GbFDMCWW8AABSiy?format=png&name=small)
</details>

**Your Turn**: Now it’s your turn! We’re going to dive into the sklearn.linear_model package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “lm” to make it easier to use throughout our notebook.

In [None]:
# freehand code


**Explanation**: *With the scikit-learn library, specifically the linear model module, which includes methods for creating linear regression models*.


### Goal 19: Setting up our linear model.

Let’s create a model to help with the training that we will do for our dataset.  



#### Blockly


**Step 1 - Create and assign a variable**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **regr**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

In Blockly, bring in the VARIABLES menu. On the "Variables" menu, click Create Variable, type a name for our model, **regr**. Then, drag a "SET" block to the workspace for the created variable. This block allows us to create a new variable and assign a value to it

**Step 2 - Create the linear regression model**

Using the neighbors library, we call the **LinearRegressor**() to create the linear regression model.

From the Variable menu, drag a Create block for the lm variable. On the create list box select the option LinearRegressor. This specifies the type (class) of object we want to create, which is the KNeighborsRegressor from the neighbors module.

Get a Create block for the lm variable from the Variables menu. With that, a new object of the model, LinearRegressor, is created. The LinearRegressor is a type of regression model that uses one specific number to predict one other number.

**Step 3 - Store the linear regressor model in a variable**

We can now connect the **regr** variable with the **LinearRegressor** model.


**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GbFD3KsXwAAqIWJ?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand


**Step 1 - Create the linear regression model**

Using the linear model library, we call the **LinearRegression**() method

`lm.LinearRegression()`

**Step 2  - Store the regression model in a variable**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable called **regr**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

`regr = lm.LinearRegression()`

**Step 3 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GbFD57vWsAAOizJ?format=png&name=small)
</details>


**Your Turn**: Now give it a go and see what you get!


In [None]:
# freehand code







**Explanation**:  *You have created a linear regression model. By creating this model, we're preparing to analyze how one variable can be predicted by a bunch of other ones*

### Goal 20: Train and score the regressor model.

Now that we’ve brought in our linear regression model, let’s train the model to see how it will learn from the data points that we have in the file.


#### Blockly



**Step 1 - Prepare to train the model**

To train data using the classifier model, we use the model and call the fit() method from it. This will use the ‘fit’ method to train the model that we want to train on.  

From the Variable menu, drag the DO block for the **regr** variable, and select the fit function as the do operation. This specifies the function we want to call, which is the fit method of the linear regression.


**Step 2 - Have the training features ready**

The next step for training the model is to select the features to train the regressor. In this step, we select the features and add them as a dataframe in the parameter. In this case, the model will train (learn) the linear regressor variables and use it to predict our other variable

From the Lists menu, drag a dictVariable, select the **X** variable from the list of available variables

**Step 3 -  Have the training label ready**

So what are we trying to predict? Next, we need to add the data labels for the selected features. We add the data labels (**Emission_Offset**) as a parameter in the fit() method.

We add the variable Y (**Emission_Offset**) as a parameter in the fit() method.



**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method() from the neighbors library. Just as in the previous step, we will just replace the fit() method with the score() method. Based on the ‘fit’, we will try to see how much we were able to predict in our training dataset.

This will give us the linear regression correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.

Right-click on the "**regr.fit**" block and select "Duplicate" from the context menu. This creates a copy of the block. Within the duplicated block, click on the method dropdown menu and select "score" from the list of available methods. The score method will work similarly to fit, and will use the training features and label to measure how much of the training data was learned.  

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GrZ7hw9XkAEC1E2?format=png&name=small)
</details>


In [3]:
# blockly code


#### Freehand


**Step 1 - Prepare to train the model**

To train data using the regressor model, linreg, we use the model and call the fit() method from it. This will use the **‘fit’** method to train the model that we want to train on.  

`regr.fit()`

**Step 2 - Have the training features ready**

The next step for training the model is to select the features to train the regressor. In this step , we select the features and add them as a dataframe in the parameter. In this case, the model will train (learn) the regressor with the scaled features we have stored in the variable **X**

`regr.fit(X)`


**Step 3 - Have the training label ready**

So what is the label that we are trying to predict? Next, we need to add the data labels for the selected features. We add the data labels, **Y (CO2Emission)** as a parameter in the fit() method.

`regr.fit(X,Y)`

**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method() from the neighbors library. Just as the previous step, we will just replace the fit() method with the **score**() method. Based on the ‘fit’, we will try to see how much we were able to predict in our training dataset. This will give us the linear regression correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.
regr.score(X,Y)

**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GrZ7r-XWcAADCUg?format=png&name=small)
</details>

**Your Turn**: Now it's time to train and score your regressor model! Prepare to train the model by calling the fit method and setting up your training features and labels. You’ll measure the correctness of your model by using the score method and evaluate how well your model has learned from the training data. Let’s use the same steps to call the fit method, add your training features and labels, and measure the correctness of your model. Finally, run the code to see your results!

In [None]:
# freehand code




**Explanation**: *You have trained the model using data from the train dataset. It takes some feature and Emission Offset as the output we want to predict. The fit function makes the model learn the relationship between these variables. You have also calculated the model’s R-squared score on the training data, which tells us how well the model's predictions match the actual Emission Offset values in the training dataset. An R-squared score closer to 1 means the model’s predictions are correct, while a score closer to 0 means the predictions are less correct*.

## WHAT DID YOU LEARN?

Real-world data sets aren’t perfect and we need to analyze the data thoughtfully in order to make correct predictions and draw correct results. We can clean the data in a variety of ways, but we must understand that how we clean the data will affect our model and therefore, our result. As we work, it’s important to conduct versioning and keep track of the many versions of our data sets.

## WHAT’S NEXT?

ToDO

## ANY EXTRAS?


ToDO