# Data Wrangling  (Data exploration)

- Data Wrangling is the process of cleaning, transforming  and structuring raw data into a usuable format , used for analysis and modeling  

- 1 - data cleaning 
- 2 - data transforming 
- 3 - data structuring 

# Why do we need to wrangle our data?
- Real-world data is very messy, inconsistent and incomplete
- more then 70% of the time in DS is actually spent on preparing the data
- Good wrangled dataset makes your models or insights accurate 

# step 1- Data cleaning 

#### Identify Isssues in our dataset before Wrangling 

- 1 - Missing values
- 2 - duplicates datasets
- 3 - Categorial data & numeric data

##### step 1a - Missing Values 

In [11]:
# import library - pandas
import pandas as pd

In [12]:
# step 2 loaded the dataset

df = pd.read_csv("C:/Users/Muham/Downloads/EX-ROLES/AMDARI/Edwin_Mentorship/data/patients.csv")
df.head()

Unnamed: 0,patient_id,age,gender,bmi,smoker,chronic_cond,injury_type,signup_date,referral_source,consent,clinic_id,insurance_type
0,1,29,Male,23.9,False,,Knee,12:44.3,Insurance,True,6,Public
1,2,68,Female,31.7,True,,Shoulder,12:44.3,Insurance,False,3,Public
2,3,64,Male,27.5,False,,Hip,12:44.3,GP,True,5,Public
3,4,41,Female,26.1,False,,Back,12:44.3,GP,True,2,Private-Premium
4,5,22,Female,21.3,False,,Shoulder,12:44.3,Insurance,True,4,Private-Basic


In [13]:
# step 3 : identify missing values
df.isnull().sum()

patient_id            0
age                   0
gender                0
bmi                   0
smoker                0
chronic_cond       3075
injury_type           0
signup_date           0
referral_source       0
consent               0
clinic_id             0
insurance_type        0
dtype: int64

In [None]:
# we've identified these missing values:
# we then need to fix them by cleaning the data

# How do you fix them ? 

# 1. Drop the rows with misisng values
# 2. Fill the rows with a specific value

# When should you drop the rows or when should your fill the rows with a specific value?

# step 1 - find the percentage of the missing values in each column 
  # If the the percentage of the missing values is greater than  > (30-50%) ~ 40% ----  consider dropping the rows
  # if thr percentage of the missing values is less than  < (30-50%) ~ 40% ----  consider filling the rows with a specific value


# step 2 -- filling up the missing values 
 # check if the data types is numeric or categorical
     # numeric data types are just numbers -- 4 columns in the dataset
      # catgeorical data types are text or strings (non-numeric) -- 8 columns in the dataset

# if the data type of that columns is numeric ---- fill it with mean or the median 
# if the data type of that columns is categorical ---- fill it with the mode (the most frequent value)

In [14]:
#  step 4 : percentage of missing values in each column

missing_percentage = df.isnull().mean() * 100
print(missing_percentage)

patient_id          0.00000
age                 0.00000
gender              0.00000
bmi                 0.00000
smoker              0.00000
chronic_cond       61.41402
injury_type         0.00000
signup_date         0.00000
referral_source     0.00000
consent             0.00000
clinic_id           0.00000
insurance_type      0.00000
dtype: float64


In [6]:
# drop the rows with missing values
df = df.dropna()

In [15]:
# fill the rows with specific values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5007 entries, 0 to 5006
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   patient_id       5007 non-null   int64  
 1   age              5007 non-null   int64  
 2   gender           5007 non-null   object 
 3   bmi              5007 non-null   float64
 4   smoker           5007 non-null   bool   
 5   chronic_cond     1932 non-null   object 
 6   injury_type      5007 non-null   object 
 7   signup_date      5007 non-null   object 
 8   referral_source  5007 non-null   object 
 9   consent          5007 non-null   bool   
 10  clinic_id        5007 non-null   int64  
 11  insurance_type   5007 non-null   object 
dtypes: bool(2), float64(1), int64(3), object(6)
memory usage: 401.1+ KB


In [None]:
# whats is mean - mean is simply the average of the values in that column
# whats is median - median is the middle value of the values in that column
# whats is mode - mode is the most frequent value in that column

In [16]:
# step 5 :  filling the chronic conditions with the mode
df['chronic_cond'] = df['chronic_cond'].fillna( df['chronic_cond'].mode()[0])

In [17]:
df.head()

Unnamed: 0,patient_id,age,gender,bmi,smoker,chronic_cond,injury_type,signup_date,referral_source,consent,clinic_id,insurance_type
0,1,29,Male,23.9,False,Diabetes,Knee,12:44.3,Insurance,True,6,Public
1,2,68,Female,31.7,True,Diabetes,Shoulder,12:44.3,Insurance,False,3,Public
2,3,64,Male,27.5,False,Diabetes,Hip,12:44.3,GP,True,5,Public
3,4,41,Female,26.1,False,Diabetes,Back,12:44.3,GP,True,2,Private-Premium
4,5,22,Female,21.3,False,Diabetes,Shoulder,12:44.3,Insurance,True,4,Private-Basic


In [18]:
# step 6:  checking for the duplicates values 
df_dup = df.duplicated().sum()

In [19]:
# drop the duplicates values
df = df.drop_duplicates()

In [20]:
df_dup = df.duplicated().sum()
df_dup

np.int64(0)

# Interview Questions I

- 1 - how do you identify missing values in pandas
- 2 - whats are the strategies to handle missing values
- 3 - when would you choose to drop vs impute missing data


# Answers 

- 1 - i first use isnull().sum() to have a summary of the missing values
- 2 - if we have <=40% of the misisng values --- we can safely drop the missing values 
  - if the we have >=40% of the missing values --- we imput the missng point with mean, median or mode
  - if the columns has numeric values --- we use the mean or the median 
  - if the columns has categorical values (non-numeric values) - we impute with the mode (frequency of the data points)

# Interview Questions II

- 1- how do you remove duplicates from your rows in pandas 
- 2 - Whats the difference between duplicated() and drop_duplicated()

# Answers 

- 1 - to remove deuplicates you first need to identify the duplicates by using 
    - duplicated() --- method
    - drop_duplicated() --- method

- 2 - duplicated() actually return a boolean output indicating wheather a rows is suplicate or not (true or false)
    - drop_duplicated() actually removes all duplicates from the dataframe


### 2 -  Data Transformation


#### step 1c : Filtering & Selecting Data

- why you need to filter and select data sometimes 
 - often need to focus on  specifc part or subset of your data

In [21]:
df.head()

Unnamed: 0,patient_id,age,gender,bmi,smoker,chronic_cond,injury_type,signup_date,referral_source,consent,clinic_id,insurance_type
0,1,29,Male,23.9,False,Diabetes,Knee,12:44.3,Insurance,True,6,Public
1,2,68,Female,31.7,True,Diabetes,Shoulder,12:44.3,Insurance,False,3,Public
2,3,64,Male,27.5,False,Diabetes,Hip,12:44.3,GP,True,5,Public
3,4,41,Female,26.1,False,Diabetes,Back,12:44.3,GP,True,2,Private-Premium
4,5,22,Female,21.3,False,Diabetes,Shoulder,12:44.3,Insurance,True,4,Private-Basic


In [22]:
# 1st way is filtering the data with a single condition 

df[df['age'] == 30]

Unnamed: 0,patient_id,age,gender,bmi,smoker,chronic_cond,injury_type,signup_date,referral_source,consent,clinic_id,insurance_type
31,32,30,Female,21.5,False,Diabetes,Knee,12:44.3,Self-Referral,True,7,Private-Top-Up
210,211,30,Male,25.6,False,Diabetes,Hip,12:44.3,GP,True,6,Private-Top-Up
212,213,30,Female,26.6,False,Diabetes,Back,12:44.3,Self-Referral,True,4,Private-Basic
308,309,30,Female,28.4,False,Hypertension,Neck,12:44.3,Hospital,True,1,Public
351,352,30,Male,29.3,True,Diabetes,Back,12:44.3,GP,True,1,Public
...,...,...,...,...,...,...,...,...,...,...,...,...
4864,4865,30,Male,23.2,False,Diabetes,Back,12:44.3,Insurance,True,2,Private-Premium
4888,4889,30,Female,18.3,False,Diabetes,Hip,12:44.3,GP,True,1,Public
4915,4916,30,Male,30.6,True,Diabetes,Ankle,12:44.3,GP,True,4,Private-Premium
4942,4943,30,Male,29.2,True,Diabetes,Hip,12:44.3,Self-Referral,True,2,Private-Premium


In [6]:
# filter rows with multiple conditions
df[  (df['age'] == 30) &  (df['gender'] == 'Male') ]

Unnamed: 0,patient_id,age,gender,bmi,smoker,chronic_cond,injury_type,signup_date,referral_source,consent,clinic_id,insurance_type
210,211,30,Male,25.6,False,Diabetes,Hip,12:44.3,GP,True,6,Private-Top-Up
351,352,30,Male,29.3,True,,Back,12:44.3,GP,True,1,Public
482,483,30,Male,24.7,False,,Neck,12:44.3,GP,True,3,Public
523,524,30,Male,29.1,True,,Neck,12:44.3,Hospital,True,2,Private-Basic
594,595,30,Male,30.1,True,,Back,12:44.3,GP,True,6,Public
1137,1138,30,Male,25.1,False,,Knee,12:44.3,Insurance,True,6,Public
1285,1286,30,Male,25.3,False,,Back,12:44.3,GP,True,4,Private-Basic
1391,1392,30,Male,19.7,False,,Back,12:44.3,Hospital,True,2,Private-Basic
1493,1494,30,Male,19.7,False,,Hip,12:44.3,GP,True,3,Public
1598,1599,30,Male,20.0,False,,Neck,12:44.3,Self-Referral,True,5,Public


In [23]:
# selecting specific columns --- choosing a columns to work with

df['injury_type']

0           Knee
1       Shoulder
2            Hip
3           Back
4       Shoulder
          ...   
4995        Back
4996    Shoulder
4997        Back
4998        Knee
4999       Ankle
Name: injury_type, Length: 5000, dtype: object

In [24]:
# select multiple columns 

df[ ['age', 'gender', 'chronic_cond', 'injury_type'] ]

Unnamed: 0,age,gender,chronic_cond,injury_type
0,29,Male,Diabetes,Knee
1,68,Female,Diabetes,Shoulder
2,64,Male,Diabetes,Hip
3,41,Female,Diabetes,Back
4,22,Female,Diabetes,Shoulder
...,...,...,...,...
4995,33,Female,Diabetes,Back
4996,79,Female,Diabetes,Shoulder
4997,54,Female,Diabetes,Back
4998,84,Male,Diabetes,Knee


# indexing of the data 

- iloc - position based selection 
- loc - label - based selection

In [28]:
df.head(11)

Unnamed: 0,patient_id,age,gender,bmi,smoker,chronic_cond,injury_type,signup_date,referral_source,consent,clinic_id,insurance_type
0,1,29,Male,23.9,False,Diabetes,Knee,12:44.3,Insurance,True,6,Public
1,2,68,Female,31.7,True,Diabetes,Shoulder,12:44.3,Insurance,False,3,Public
2,3,64,Male,27.5,False,Diabetes,Hip,12:44.3,GP,True,5,Public
3,4,41,Female,26.1,False,Diabetes,Back,12:44.3,GP,True,2,Private-Premium
4,5,22,Female,21.3,False,Diabetes,Shoulder,12:44.3,Insurance,True,4,Private-Basic
5,6,83,Male,21.3,False,Diabetes,Back,12:44.3,GP,True,4,Private-Premium
6,7,47,Female,19.4,False,Hypertension,Neck,12:44.3,GP,True,5,Public
7,8,77,Female,29.5,True,Hypertension,Back,12:44.3,GP,True,5,Public
8,9,63,Female,26.1,False,Diabetes,Shoulder,12:44.3,Self-Referral,True,2,Private-Basic
9,10,70,Female,27.3,False,Diabetes,Back,12:44.3,GP,True,1,Private-Top-Up


In [None]:
# iloc works - position based indexing --- positon where each value is located
df.iloc[1,7]

'12:44.3'

In [None]:
#Loc works - label based indexing --- label where each value is located
df.loc[8, 'referral_source']

'Self-Referral'

#### INTERVIEWS QUESTIONS

1 - what is the different between loc and iloc
2 - how do you filter rows based on muiltple condition


### ANSWERS 
- loc uses label (name), iloc uses integer to check for position

## Step 2 - Transforming Data - enriching your data

In [33]:
# create new column(straight forward)
df['bmi_percentage'] = df['bmi'] / 100

In [34]:
df.head()

Unnamed: 0,patient_id,age,gender,bmi,smoker,chronic_cond,injury_type,signup_date,referral_source,consent,clinic_id,insurance_type,bmi_percentage
0,1,29,Male,23.9,False,Diabetes,Knee,12:44.3,Insurance,True,6,Public,0.239
1,2,68,Female,31.7,True,Diabetes,Shoulder,12:44.3,Insurance,False,3,Public,0.317
2,3,64,Male,27.5,False,Diabetes,Hip,12:44.3,GP,True,5,Public,0.275
3,4,41,Female,26.1,False,Diabetes,Back,12:44.3,GP,True,2,Private-Premium,0.261
4,5,22,Female,21.3,False,Diabetes,Shoulder,12:44.3,Insurance,True,4,Private-Basic,0.213


In [None]:
# create new column ( you want to apply conditions to the new column)

df['age_range'] = df['age'].apply(lambda x: "young" if x <=30 else "old")

In [None]:
df.head(3)

In [None]:
#Renaming columns 
df.rename(columns={"age" : "Maturity_age"}, inplace=True)

In [None]:
df.head()

### INTERVIEW QUESTION

1- How do you create a new column based on existing data
2 - what does apply() do in pandas

#### Answers 
1 - can create them using two method
 - creating a columne directly by defining a new column name (e.g df['salaryAge'])
 - OR creating a column by applying a condition using the .apply()/lambda function method in pandas

2 - what does apply() do in pandas 
  lets you apply a lambda function with condition to each elements in the dataset
