## Using categorical data in regression models

In the last workshop we revisted the penguins dataset and used a linear regression model to try and predict penguin body mass based on flipper length. We achieved an $R^{2}$ score of approximately 75% with this model, with an RMSE of just under 400g. 

The penguins dataset contains further information that we might be able to use to increase the performance of our model. 

#### Exercise 1

Use the other numeric data contained in the penguins dataset to build a model that uses **multiple inputs** to predict penguin body mass. 

You will need to:
- load the penguins dataset from `seaborn`
- for simplicity, drop any rows containing NA values
- use the `info` method to determine which columns have a numerical data type (these will be used as your _features_ to predict body mass)
- instantiate a linear model from `scikit-learn`
- define training and target variables
- split your inputs into test and train sets (remember to set `random_state` so that the sets are reproducible)
- train your model
- make predictions with your model
- print the $R^2$ score and RMSE


In [2]:
import seaborn as sns


In [4]:
#### ADD CODE HERE























Unfortunately, according to our $R^{2}$ score, this additional data doesn't appear to add very much to our model performance (compared to our 75% using only one input feature). We do have other information in the penguins dataset, but it is _categorical_ data. In the following exercises we will look at how we can exploit this type of data in our regression models.

#### Exercise 2

Make a plot to show the relationship between bill length and body mass that colour-codes the data points based on penguin species.

In [6]:
#### ADD CODE HERE




_Do you think that knowledge of the species of penguin would be helpful when trying to predict body mass?_

Looking above, using the species categories to inform our model would surely add some new and valuable information. To make this possible, we need to transform our categorical data to make it appropriate for use in our regression model.

To allow the use of the species category in our linear regression model we need to convert the categorical variables into numerical values. We are going to try to do that using something called **one hot encoding**. To apply one hot encoding to the species data in our penguins dataset, follow the instructions for _Exercise 3_ below. 

#### Exercise 3

**(a)** Add three new columns to the penguins dataset:
- The `Gentoo` column should be 1 when the species is Gentoo, and zero in all other cases.
- The `Adelie` column should be 1 when the species is Adelie, and zero in all other cases.
- The `Chinstrap` column should be 1 when the species is Chinstrap, and zero in all other cases.

Having done this, you will have created three new columns in your DataFrame, with each column containing a value of 1 for cases where the species variable matches the species specified in the column title, and a value of 0 for all other cases. Essentially we have created a true / false flag in these columns, following:
- 1 if `species` = `column_name`
- 0 if not



In [7]:
#### ADD CODE HERE











**(b)** By transforming our variables in this way we can extract information from our categorical data in a way that can be interpreted by our regression model. 

To convince yourself of the value of using the species information in this format:
- plot the relationship between the newly created `Gentoo` column and `body_mass_g` 
- print the linear correlation coefficient between `Gentoo` and `body_mass_g`.

In [8]:
#### ADD CODE HERE



In [9]:
#### ADD CODE HERE



**(c)** Do the same for the newly created `Adelie` and `Chinstrap` columns. Then ask yourself the following questions:

- Of the three species, which has the potential to give the most additional sensitivity to our model? 
- Why?
- Does this follow from what you might expect upon further study of the relationship plot from _Exercise 2_?

In [10]:
#### ADD CODE HERE



In [11]:
#### ADD CODE HERE



The Gentoo species seems to have the most significant linear correlation with penguin body mass, which indicates that it may provide the most additional information to help us predict body mass of the three species. 

Looking at the two dimensional relationship plot from _Exercise 2_, we can see that Gentoo penguins almost exclusively occupy higher body mass values. This means that knowing whether or not a penguin belongs to the Gentoo species provides valuable information about body mass, in that we would expect higher body mass values for Gentoo penguins compared with the other two species. 

#### Exercise 4

Build the regression model again as before, but this time add these new columns as features to your training data. Complete all the steps that you followed during _Exercise 1_. **Remember to use the same `random_state` and `test_size`** to ensure a fair comparison with your output from _Exercise 1_.
- Do you see any improvement in the model performance?
- Does one of your new categories in particular make the most difference, or is it a joint effort?

In [12]:
#### ADD CODE HERE























If you try to incrementally add the penguin species to your linear regression model and rerun the prediction you will find that the Gentoo variable does indeed gove the biggest boost in model performance, as expected based on _Exercise 3_.

Transforming categorical data in this way is a useful trick to remember for data analysis tasks. It is often helpful to be able to incorporate categorical information into numerical models. 