# Introduction

> The code allows end users to reshape their data, making it ready for a lot of the tools that are used within this toolkit.

> The idea is to provide a TRUE or FALSE indicator to the values within a field

> For example:
>> We have a dataset that contains a field 'sex', this field tells us if an individual is either Male or Female. As it stand, the dataset has the raw value inside that field e.g. 'Male' or 'Female'

>> In order to use this dataset with the tools you must provide a 1 or 0 value in this field to indicate if an individual is either Male or Female, we can do this by following this process:

>>> Current column 'sex' has values 'Male' or 'Female'

>>> We can create a new field 'Is Male' that will contain our 0 or 1 value (0 = is not male, 1= is male)

>>> This new field can then hold the 0 or 1 values, which we can in turn later use for purposes of this toolkit and model training

# Imports

In [4]:
import pandas as pd

# Import our original dataset
>  For this example we will use a survey dataset around McDonald's

In [5]:
df = pd.read_csv('dataset/MCD_survey_data.csv')

> We can see by viewing the first 10 values what our dataset looks like

In [52]:
df.head(10)

Unnamed: 0,first_name,last_name,age,email,country,postal_code,favorite_color,frequency_of_visits,favorite_menu_item,employment_status,Gender,Is Male,Above 30
0,Basilius,Atcheson,37,batcheson0@mozilla.org,Vietnam,,red,4.4,Big Mac,Full-time,Male,1,1
1,Elston,Egar,51,eegar1@tuttocitta.it,China,,blue,0.6,McNuggets,Part-time,Male,1,1
2,Stormi,Goldsby,51,sgoldsby2@ibm.com,Indonesia,,red,5.5,Quarter Pounder,Retired,Female,0,1
3,Allie,Eskrick,35,aeskrick3@prlog.org,Portugal,7005-724,blue,5.2,Quarter Pounder,Retired,Male,1,1
4,Cornela,Skeffington,81,cskeffington4@archive.org,China,,red,0.9,McNuggets,Student,Female,0,1
5,Cyndy,Hanham,37,chanham5@yahoo.com,China,,red,2.5,Quarter Pounder,Part-time,Female,0,1
6,Napoleon,Coldrick,37,ncoldrick6@timesonline.co.uk,Japan,899-5241,blue,0.3,Quarter Pounder,Part-time,Male,1,1
7,Delcina,Jandera,61,djandera7@archive.org,Sweden,111 61,red,4.6,Big Mac,Full-time,Female,0,1
8,Avram,Yerrell,34,ayerrell8@pen.io,Indonesia,,blue,3.6,Quarter Pounder,Part-time,Male,1,1
9,Tucky,Hawtry,40,thawtry9@sakura.ne.jp,Russia,356003,green,3.2,Big Mac,Full-time,Male,1,1


# Reshape the dataset accordingly
> The below example will show us how we can accordingly reshape the dataset. The examples will show how to set a TRUE (1) or FALSE (0) value in place

> Two seperate examples will be used

 ## Example 1 - Gender Field

> Discover the values inside our field

> We can see we have two types of data, Female and Male

In [39]:
df['Gender'].value_counts()

Female    513
Male      487
Name: Gender, dtype: int64

> Check for any empty fields and address them accordingly

In [41]:
df['Gender'].isna().sum()

0

> Based on the above, let's accordingly shape our dataset. There are only two values that can be occurent which is either Male or Female. We will simply check for each value which one is present and assign either TRUE or FALSE accordingly. We will be using a replace function

> Our new column can be called something like "Is Male"

In [37]:
df['Is Male'] = df['Gender']
df['Is Male'] = df['Is Male'].replace('Male', 1)
df['Is Male'] = df['Is Male'].replace('Female', 0)

> We can now see that the newly created field "Is Male" contains either a 0 or 1 value to indicate if an individual is male or not

In [38]:
df['Is Male']

0      1
1      1
2      0
3      1
4      0
      ..
995    1
996    0
997    1
998    1
999    0
Name: Is Male, Length: 1000, dtype: int64

> Complete a value count to ensure we have same amount of coresponding values

In [42]:
df['Is Male'].value_counts()

0    513
1    487
Name: Is Male, dtype: int64

> Based on the above example, we can now see we have appended the relevant values to our dataset

# Example 2 - Age field
> Another example we can use is the age field. In this example we will split our dataset into a group of below 30 and above 30


> Firstly, we will create a new field in the data that is a copy of age.
> Next we will apply a lambda function that will replace relevant values if they are above or below 30
> All of the above is accomplished by the use of the below lambda function

In [54]:
# Set accordingly to your needs
new_col_name = 'Above 30'
condition = 30 #above threshold


df[new_col_name]=df['age'].apply(lambda x: 1 if x>=condition else 0)

> On the contrary we can do the same to detect if individuals are below 30. Just flip the '>' to < in the lambda function

In [55]:
new_col_name = 'Below 30'
condition = 30
df[new_col_name]=df['age'].apply(lambda x: 1 if x<=condition else 0)

> Finally, we can take our values and place them into a new dataframe that we can then later use within the toolkit

In [57]:
new_df = df[['Is Male','Above 30', 'Below 30']]

In [58]:
new_df

Unnamed: 0,Is Male,Above 30,Below 30
0,1,1,0
1,1,1,0
2,0,1,0
3,1,1,0
4,0,1,0
...,...,...,...
995,1,1,0
996,0,0,1
997,1,1,0
998,1,0,1


> Below is an example how to export this dataset

In [60]:
df.to_csv('dataset/mcd_toolkit_ready_data.csv', index=False)
