### Feature Engineering
- Often, we need to use existing columns to create new ones to give additional insight to the relationships we have in data.
- A common use-case is when working with strings that we have to encode to numerical features. 

### 1. Feature Engineering

#### 1.1 - Encoding binary variables.
- Use pandas, numpy and scikit learn. 

In [2]:
import pandas as pd
df = pd.DataFrame(["Yes", "No", "No", "Yes"], columns=['Churn'])

### Exercise 1.2 - Get the proportion of visitors on Sundays in the following dataset.
- What are the datatypes?
- Transform the date into datetime type. 
- Create a new column "weekday" from date that indicates the day number or day name. Hint: Datetime has weekday method, that returns a weekday. 

In [3]:
# Let's create our dataframe for the exercise:
from datetime import datetime

dates = [f"2023-01-{day:02d}" for day in range(1, 32)]
visitors = [270, 470, 300, 420, 390, 360, 330, 300, 420, 490, 480, 410, 460, 480, 220, 450, 390, 430, 400, 350, 380, 290, 330, 490, 420, 470, 340, 500, 260, 420, 400]

df_visitations = pd.DataFrame({'date': dates, 'visitors': visitors})
df_visitations.head()

Unnamed: 0,date,visitors
0,2023-01-01,270
1,2023-01-02,470
2,2023-01-03,300
3,2023-01-04,420
4,2023-01-05,390


In [10]:
df_visitations['date'] = pd.to_datetime(df_visitations['date'])
df_visitations.head()





Unnamed: 0,date,visitors
0,2023-01-01,270
1,2023-01-02,470
2,2023-01-03,300
3,2023-01-04,420
4,2023-01-05,390


In [11]:
# extract the weekday and add it as a new column
df_visitations['weekday'] = df_visitations['date'].dt.day_name()
df_visitations.head()

Unnamed: 0,date,visitors,weekday
0,2023-01-01,270,Sunday
1,2023-01-02,470,Monday
2,2023-01-03,300,Tuesday
3,2023-01-04,420,Wednesday
4,2023-01-05,390,Thursday


In [5]:
df_visitations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      31 non-null     datetime64[ns]
 1   visitors  31 non-null     int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 624.0 bytes


### Exercise 1.3 - Calculate the `weekly_average` of each shop. Transform it into a seperate column.

In [13]:
df_exp = pd.DataFrame({
    'Shop Name': ['Madrid', 'Barcelona', 'Sevilla'],
    'Monday': ["100€", "200€", "150€"],
    'Tuesday': ["90€", "80€", "70€"],
    'Wednesday': ["120€", "150€", "110€"],
    'Thursday': ["100€", "90€", "120€"],
    'Friday': ["70€", "110€", "130€"]
})

df_exp

Unnamed: 0,Shop Name,Monday,Tuesday,Wednesday,Thursday,Friday
0,Madrid,100€,90€,120€,100€,70€
1,Barcelona,200€,80€,150€,90€,110€
2,Sevilla,150€,70€,110€,120€,130€


In [14]:
columns = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
df_exp[columns] = df_exp[columns].replace('[^0-9]', '', regex=True).astype(int)
df_exp['avg'] = df_exp[columns].mean(axis=1)
df_exp

Unnamed: 0,Shop Name,Monday,Tuesday,Wednesday,Thursday,Friday,avg
0,Madrid,100,90,120,100,70,96.0
1,Barcelona,200,80,150,90,110,126.0
2,Sevilla,150,70,110,120,130,116.0


### Exercise 1.4 - Feature Engineering with Regex, String Operators, Numpy.  
Here is some help if you need: [Regex w3](https://www.w3schools.com/python/python_regex.asp)
- Distance with kilometers and miles with text. 
- Specify the `distance_value` and `distance_unit` in seperate columns. Map the units kilometer and mile so that they are simply identifiable. (i.e. km, mi or kilometer, mile)
- Now create a new column `distance_km` that consists only of the distances in kilometres. FYI 1 mile is 1.61 kilometers. 

You might need to use str.extract(), str.lower(), astype(), map(), np.where() methods. 

In [15]:
df_distances = pd.DataFrame({
    "city1": ["Madrid", "Madrid", "Madrid", "London", "London", "London"],
    "city2": ["Barcelona", "Valencia", "Sevilla", "Manchester", "Liverpool", "Birmingham"],
    "distance": ["505 kilometres", "303KM", "390.5 km", "163 miles", "220 mi.", "miles: 120"]
})

In [16]:
df_distances

Unnamed: 0,city1,city2,distance
0,Madrid,Barcelona,505 kilometres
1,Madrid,Valencia,303KM
2,Madrid,Sevilla,390.5 km
3,London,Manchester,163 miles
4,London,Liverpool,220 mi.
5,London,Birmingham,miles: 120


In [17]:
import re

pattern = r'(?P<value>[\d\.]+)(?P<unit>[^\d]+)'

df_distances[['value', 'unit']] = df_distances['distance'].str.extract(pattern, expand=True)

In [20]:
df_distances

Unnamed: 0,city1,city2,distance,value,unit
0,Madrid,Barcelona,505 kilometres,505.0,kilometres
1,Madrid,Valencia,303KM,303.0,KM
2,Madrid,Sevilla,390.5 km,390.5,km
3,London,Manchester,163 miles,163.0,miles
4,London,Liverpool,220 mi.,220.0,mi.
5,London,Birmingham,miles: 120,,


# Dimensionality Reduction
- From feature selection to feature extraction. But what is the difference?


### 2.1 - Remove all the redundant features from the `df_exp` and `df_distances` datasets. 
- What makes a feature redundant?

### 2.2 - Feature Selection - What can you say about the relationship of these columns? What columns explain each other?

- Duration: Exercise duration
- Pulse: Average pulse rate during exercise
- Maxpulse: Maximum pulse
- Calories: Calories burned during the session. 
- Kilojoules: Kilojoules burned during the session. (Calories * 4.18)
- Pulse_proportion: Average pulse of the max pulse.
- Weekday: Day of the week (0-6) 
- Session count: Each session is attributed by default count 1. 

As you are all aware, linear models assumes feature independece (Naive Bayes etc.)
Thus strongly correlated features can introduce bias into our models (not good). 
Features that have a low correlation on the other hand do not explain a lot the targets variance. 

- What columns would you select and why?

In [6]:
df_exercise = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQRm5unwrfItPTzB8MHvrD2IbCQzyUX5nJxLOXAV8wU7vDnhuWS1SY_zmNkvuaSTkpIcsbd0_yIMN5F/pub?gid=1979654030&single=true&output=csv")
df_exercise.head()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories,Kilojoules,Pulse_proportion,Weekday,Session count
0,60,110,130,409.1,1714.13,0.85,6,1
1,60,117,145,479.0,2007.01,0.81,1,1
2,60,103,135,340.0,1424.6,0.76,2,1
3,45,109,175,282.4,1183.26,0.62,1,1
4,45,117,148,406.0,1701.14,0.79,4,1


#### 2.2.1 Show the correlation grid with 2 decimal points and visualise the correlation plot using seaborn heatmap (doc: [sns.heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html))

#### 2.2.2 - Select the features that are explaining the variance the most. 

### 2.3 - Feature Selection and Extraction
- Let's use our dataset from previous week about churn.
- You can get more info about the attributes from here [link to data](https://archive.ics.uci.edu/ml/datasets/Iranian+Churn+Dataset)
- Drop already columns FN&FP as they are irrelevant for us. 

#### 2.3.1 - Import the dataset and drop columns `FN` & `FP`. 

In [7]:
df_churn = pd.read_csv("data/customer_churn.csv")
df_churn = df_churn.drop(["FN", "FP"], axis=1)
df_churn.head()

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
0,8,0,38,0,4370,71,5,17,3,1,1,30,197.64,0
1,0,0,39,0,318,5,7,4,2,1,2,25,46.035,0
2,10,0,37,0,2453,60,359,24,3,1,1,30,1536.52,0
3,10,0,38,0,4198,66,1,35,1,1,1,15,240.02,0
4,3,0,38,0,2393,58,2,33,1,1,1,15,145.805,0


#### 2.3.2 - Define target and features.

#### 2.3.3 - Analyse the correlation between the features.

#### 2.3.4 - Divide the data into training and test sets 
- Keep the test size at 80/20 split. 
- Bear in mind that there is imbalance in our target variable and stratification is needed.

#### 2.3.5 - Standardise the data.

#### 2.3.6 - Instantiate the PCA and transform the scaled features. 
- Make sure that the at least 80% of the variance is preserved set.

#### 2.3.7 - How many principal components explain the variance 80%? How much does the large principal component explain the variance?

#### 2.3.8 - Draw the elbow plot and elaborate on how many components you would choose given the elbow plot.

#### 2.3.9 - Instatiate the [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and fit the model.

#### 2.3.10 - Use the PCA model to make predicitions. 
- Get the accuracy score and create confusion matrix. 