# Creating Features
>  In this chapter, you will explore what feature engineering is and how to get started with applying it to real-world data. You will load, explore and visualize a survey response dataset, and in doing so you will learn about its underlying data types and why they have an influence on how you should engineer your features. Using the pandas package you will create new features from both categorical and continuous columns.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Python, Datacamp, Machine Learning]
- image: images/datacamp/1_supervised_learning_with_scikit_learn/2_regression.png

> Note: This is a summary of the course's chapter 1 exercises "Feature Engineering for Machine Learning in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Why generate features?

<div class=""><p>Pandas is one the most popular packages used to work with tabular data in Python. It is generally imported using the alias <code>pd</code> and can be used to load a CSV (or other delimited files) using <code>read_csv()</code>.</p>
<p>You will be working with a modified subset of the <a href="https://insights.stackoverflow.com/survey/2018/#overview" target="_blank" rel="noopener noreferrer">Stackoverflow survey response data</a> in the first three chapters of this course. This data set records the details, and preferences of thousands of users of the StackOverflow website.</p></div>

Instructions 1/4
<ul>
<li>Import the <code>pandas</code> library as <code>pd</code>. </li>
<li><code>so_survey_csv</code> contains the URL to a CSV file. Import it using Pandas into <code>so_survey_df</code>.</li>
</ul>

In [7]:
# Import pandas
import pandas as pd

# Import so_survey_csv into so_survey_df
so_survey_df = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/10-feature-engineering-for-machine-learning-in-python/datasets/so_survey_df.csv')

Instructions 2/4
<p>Print the first five rows of <code>so_survey_df</code>.</p>

In [8]:
# Print the first five rows of the DataFrame
print(so_survey_df.head(5))

      SurveyDate  ...   RawSalary
0  2/28/18 20:20  ...         NaN
1  6/28/18 13:26  ...   70,841.00
2    6/6/18 3:37  ...         NaN
3    5/9/18 1:06  ...   21,426.00
4  4/12/18 22:41  ...  £41,671.00

[5 rows x 11 columns]


Instructions 3/4
<p>Print the data type of each column in <code>so_survey_df</code>.</p>

In [9]:
# Print the data type of each column
print(so_survey_df.dtypes)

SurveyDate                     object
FormalEducation                object
ConvertedSalary               float64
Hobby                          object
Country                        object
StackOverflowJobsRecommend    float64
VersionControl                 object
Age                             int64
Years Experience                int64
Gender                         object
RawSalary                      object
dtype: object


Instructions 4/4
<p>What type of data is the <code>ConvertedSalary</code> column?</p>

<pre>
Possible Answers

Datetime

<b>Numeric</b>

String

Boolean
</pre>

**ConvertedSalary contains floats which are numeric.**

### Selecting specific data types


<div class=""><p>Often a data set will contain columns with several different data types (like the one you are working with). The majority of machine learning models require you to have a consistent data type across features. Similarly, most feature engineering techniques are applicable to only one type of data at a time. For these reasons among others, you will often want to be able to access just the columns of certain types when working with a DataFrame. </p>
<p>The DataFrame (<code>so_survey_df</code>) from the previous exercise is available in your workspace.</p></div>

Instructions
<ul>
<li>Create a subset of <code>so_survey_df</code> consisting of only the numeric (<code>int</code> and <code>float</code>) columns.</li>
<li>Print the column names contained in <code>so_survey_df_num</code>.</li>
</ul>

In [10]:
# Create subset of only the numeric columns
so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])

# Print the column names contained in so_survey_df_num
print(so_numeric_df.columns)

Index(['ConvertedSalary', 'StackOverflowJobsRecommend', 'Age',
       'Years Experience'],
      dtype='object')


**In the next lesson, you will learn the most common ways of dealing with categorical data**

## Dealing with categorical features

<p>To use categorical variables in a machine learning model, you first need to represent them in a quantitative way. The two most common approaches are to one-hot encode the variables using or to use dummy variables. In this exercise, you will create both types of encoding, and compare the created column sets. We will continue using the same DataFrame from previous lesson loaded as <code>so_survey_df</code> and focusing on its <code>Country</code> column.</p>

Instructions 1/2
<p>One-hot encode the <code>Country</code> column, adding "OH" as a prefix for each column.</p>



In [12]:
# Convert the Country column to a one hot encoded Data Frame
one_hot_encoded = pd.get_dummies(so_survey_df, columns=['Country'], prefix='OH')

# Print the columns names
print(one_hot_encoded.columns)

Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'OH_France', 'OH_India',
       'OH_Ireland', 'OH_Russia', 'OH_South Africa', 'OH_Spain', 'OH_Sweeden',
       'OH_UK', 'OH_USA', 'OH_Ukraine'],
      dtype='object')


Instructions 2/2
<p>Create dummy variables for the <code>Country</code> column, adding "DM" as a prefix for each column.</p>

In [13]:
# Create dummy variables for the Country column
dummy = pd.get_dummies(so_survey_df, columns=['Country'], drop_first=True, prefix='DM')

# Print the columns names
print(dummy.columns)

Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'DM_India', 'DM_Ireland',
       'DM_Russia', 'DM_South Africa', 'DM_Spain', 'DM_Sweeden', 'DM_UK',
       'DM_USA', 'DM_Ukraine'],
      dtype='object')


**Did you notice that the column for France was missing when you created dummy variables? Now you can choose to use one-hot encoding or dummy variables where appropriate.**

### Dealing with uncommon categories

<p>Some features can have many different categories but a very uneven distribution of their occurrences. Take for example Data Science's favorite languages to code in, some common choices are Python, R, and Julia, but there can be individuals with bespoke choices, like FORTRAN, C etc. In these cases, you may not want to create a feature for each value, but only the more common occurrences.</p>

Instructions 1/3
<ul>
<li>Extract the <code>Country</code> column of <code>so_survey_df</code> as a series and assign it to <code>countries</code>. </li>
<li>Find the counts of each category in the newly created <code>countries</code> series.</li>
</ul>

In [15]:
# Create a series out of the Country column
countries = so_survey_df.Country

# Get the counts of each category
country_counts = countries.value_counts()

# Print the count values for each category
print(country_counts)

South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
India            95
UK               95
Ukraine           9
Ireland           5
Name: Country, dtype: int64


Instructions 2/3
<ul>
<li>Create a mask for values occurring less than 10 times in <code>country_counts</code>.</li>
<li>Print the first 5 rows of the mask.</li>
</ul>

In [16]:
# Create a mask for only categories that occur less than 10 times
mask = countries.isin(country_counts[country_counts < 10].index)

# Print the top 5 rows in the mask series
print(mask.head())

0    False
1    False
2    False
3    False
4    False
Name: Country, dtype: bool


Instructions 3/3
<ul>
<li>Label values occurring less than the mask cutoff as 'Other'.</li>
<li>Print the new category counts in <code>countries</code>.</li>
</ul>

In [17]:
# Label all other categories as Other
countries[mask] = 'Other'

# Print the updated category counts
print(countries.value_counts())

South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
India            95
UK               95
Other            14
Name: Country, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


**now you can work with large data sets while grouping low frequency categories.**

## Numeric variables

### Binarizing columns

<p>While numeric values can often be used without any feature engineering, there will be cases when some form of manipulation can be useful. For example on some occasions, you might not care about the magnitude of a value but only care about its direction, or if it exists at all. In these situations, you will want to binarize a column. In the <code>so_survey_df</code> data, you have a large number of survey respondents that are working voluntarily (without pay). You will create a new column titled <code>Paid_Job</code> indicating whether each person is paid (their salary is greater than zero).</p>

In [21]:
so_survey_df = so_survey_df.fillna(0)

Instructions
<ul>
<li>Create a new column called <code>Paid_Job</code> filled with zeros.</li>
<li>Replace all the <code>Paid_Job</code> values with a 1 where the corresponding <code>ConvertedSalary</code> is greater than 0.</li>
</ul>

In [22]:
# Create the Paid_Job column filled with zeros
so_survey_df['Paid_Job'] = 0

# Replace all the Paid_Job values where ConvertedSalary is > 0
so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1

# Print the first five rows of the columns
print(so_survey_df[['Paid_Job', 'ConvertedSalary']].head())

   Paid_Job  ConvertedSalary
0         0              0.0
1         1          70841.0
2         0              0.0
3         1          21426.0
4         1          41671.0


**binarizing columns can also be useful for your target variables.**

### Binning values

<div class=""><p>For many continuous values you will care less about the exact value of a numeric column, but instead care about the bucket it falls into. This can be useful when plotting values, or simplifying your machine learning models. It is mostly used on continuous variables where accuracy is not the biggest concern e.g. age, height, wages.   </p>
<p>Bins are created using <code>pd.cut(df['column_name'], bins)</code> where <code>bins</code> can be an integer specifying the number of evenly spaced bins, or a list of bin boundaries.</p></div>

Instructions 1/2
<p>Bin the value of the <code>ConvertedSalary</code> column in <code>so_survey_df</code> into 5 equal bins, in a new column called <code>equal_binned</code>.</p>


In [25]:
# Bin the continuous variable ConvertedSalary into 5 bins
so_survey_df['equal_binned'] = pd.cut(so_survey_df['ConvertedSalary'], 5)

# Print the first 5 rows of the equal_binned column
so_survey_df[['equal_binned', 'ConvertedSalary']].head()

Unnamed: 0,equal_binned,ConvertedSalary
0,"(-2000.0, 400000.0]",0.0
1,"(-2000.0, 400000.0]",70841.0
2,"(-2000.0, 400000.0]",0.0
3,"(-2000.0, 400000.0]",21426.0
4,"(-2000.0, 400000.0]",41671.0


Instructions 2/2
<p>Bin the <code>ConvertedSalary</code> column using the boundaries in the list <code>bins</code> and label the bins using <code>labels</code>.</p>

In [26]:
# Specify the boundaries of the bins
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]

# Bin labels
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

# Bin the continuous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'], bins=bins, labels=labels)

# Print the first 5 rows of the boundary_binned column
so_survey_df[['boundary_binned', 'ConvertedSalary']].head()

Unnamed: 0,boundary_binned,ConvertedSalary
0,Very low,0.0
1,Medium,70841.0
2,Very low,0.0
3,Low,21426.0
4,Low,41671.0


**now you can bin columns with equal spacing and predefined boundaries.**