# Data Preprocessing and Transformation Assignment


In this assignment, you will practice data preprocessing and transformation techniques usingdataset from a telecom company. The dataset is stored in an SQLite database and contains two tables: customer demographics and subscription details.

Your tasks are to:

1. Read the `.db` database file.
2. Merge the two tables into one DataFrame.
3. Explore the merged DataFrame.
4. Assess and handle missing values.
5. Assess and filter out outliers.
6. Impute missing data.
7. Apply ordinal encoding to ordinal variables.
8. Apply one-hot encoding to nominal variables.
9. Merge the encoded features into the final DataFrame.
10. **For every task** create a markdown cell  and explain what you have done and also the results

> Make sure the notebook is clear of syntax error, do not output unnecessary data, keep clean and neat.




## Task 1: Read the Database (`telecom_data.db`)

In [9]:
import sqlite3

conn = sqlite3.connect('telecom_data.db')

## Task 2: Merge the Two Tables into One DataFrame called `telecom_df`

In [10]:
import pandas as pd
tables_df = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table'", conn)
print(tables_df)

           name
0      customer
1  subscription


In [11]:
customer_df = pd.read_sql_query("SELECT * FROM customer", conn)
customer_df.head()

Unnamed: 0,customer_id,age,gender,income,region
0,1,56.0,Male,44900.0,North
1,2,69.0,Male,38000.0,North
2,3,46.0,Male,47600.0,East
3,4,32.0,Other,56100.0,South
4,5,60.0,Male,78300.0,West


In [12]:
susbcription_df = pd.read_sql_query("SELECT * FROM subscription", conn)
susbcription_df.head()

Unnamed: 0,customer_id,plan,monthly_charges,contract
0,1,Standard,99.97,Two year
1,2,Basic,99.7,Two year
2,3,Premium,59.99,One year
3,4,Basic,79.21,Two year
4,5,Basic,95.03,Two year


In [13]:
# customer_id is the shared column
merge_type = "inner"
telecom_df = pd.merge(customer_df, susbcription_df, on="customer_id", how=merge_type)
telecom_df.head()

Unnamed: 0,customer_id,age,gender,income,region,plan,monthly_charges,contract
0,1,56.0,Male,44900.0,North,Standard,99.97,Two year
1,2,69.0,Male,38000.0,North,Basic,99.7,Two year
2,3,46.0,Male,47600.0,East,Premium,59.99,One year
3,4,32.0,Other,56100.0,South,Basic,79.21,Two year
4,5,60.0,Male,78300.0,West,Basic,95.03,Two year


## Task 3: Explore the Merged DataFrame
- use info, describe,..

In [14]:
telecom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   customer_id      100 non-null    int64  
 1   age              90 non-null     float64
 2   gender           100 non-null    object 
 3   income           95 non-null     float64
 4   region           100 non-null    object 
 5   plan             100 non-null    object 
 6   monthly_charges  100 non-null    float64
 7   contract         100 non-null    object 
dtypes: float64(3), int64(1), object(4)
memory usage: 6.4+ KB


## Task 4: Assess Missing Values

In [15]:
telecom_df.isnull().sum()

Unnamed: 0,0
customer_id,0
age,10
gender,0
income,5
region,0
plan,0
monthly_charges,0
contract,0


In [16]:
telecom_df[telecom_df.isnull().any(axis=1)]

Unnamed: 0,customer_id,age,gender,income,region,plan,monthly_charges,contract
5,6,,Male,52600.0,East,Basic,86.47,Month-to-month
14,15,,Male,49500.0,North,Basic,42.23,Month-to-month
20,21,,Male,,North,Premium,86.7,One year
49,50,,Female,,North,Basic,51.32,Month-to-month
54,55,,Male,78000.0,South,Basic,31.61,Month-to-month
67,68,,Other,38700.0,South,Standard,27.31,Two year
71,72,,Male,55100.0,South,Basic,25.93,Two year
73,74,46.0,Male,,East,Basic,20.86,One year
81,82,,Male,57200.0,South,Standard,66.51,Two year
82,83,65.0,Other,,South,Premium,17.36,One year


## Task 5: Impute Missing Data

In [17]:
telecom_df['age'] = pd.to_numeric(telecom_df['age'], errors='coerce')
# Converts age variable to numeric type
age_mean = telecom_df['age'].mean()

telecom_df['age'] = telecom_df['age'].fillna(age_mean)
print(telecom_df.iloc[5])
# Row 5 was originally missing age
# Replaced the missing age value with the mean

customer_id                     6
age                     43.277778
gender                       Male
income                    52600.0
region                       East
plan                        Basic
monthly_charges             86.47
contract           Month-to-month
Name: 5, dtype: object


In [22]:
telecom_df['income'] = pd.to_numeric(telecom_df['income'], errors='coerce')
income_mean = telecom_df['income'].mean()

# Replaces empty income values with the mean income
telecom_df['income'] = telecom_df['income'].fillna(income_mean)
print(telecom_df.iloc[20])
# Row 20 was originally missing income

customer_id                  21
age                   43.277778
gender                     Male
income             50407.368421
region                    North
plan                    Premium
monthly_charges            86.7
contract               One year
Name: 20, dtype: object


## Task 6: Apply Ordinal Encoding to Ordinal Variables

In [23]:
# Ordinal Variables: plan and contract
telecom_df['plan'].unique()

array(['Standard', 'Basic', 'Premium'], dtype=object)

In [24]:
telecom_df['contract'].unique()

array(['Two year', 'One year', 'Month-to-month'], dtype=object)

In [25]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder() # instatiating by making an object

In [26]:
telecom_df['plan_label_encoded'] = label_encoder.fit_transform(telecom_df['plan'])
telecom_df['contract_label_encoded'] = label_encoder.fit_transform(telecom_df['contract'])

telecom_df.head()

Unnamed: 0,customer_id,age,gender,income,region,plan,monthly_charges,contract,plan_label_encoded,contract_label_encoded
0,1,56.0,Male,44900.0,North,Standard,99.97,Two year,2,2
1,2,69.0,Male,38000.0,North,Basic,99.7,Two year,0,2
2,3,46.0,Male,47600.0,East,Premium,59.99,One year,1,1
3,4,32.0,Other,56100.0,South,Basic,79.21,Two year,0,2
4,5,60.0,Male,78300.0,West,Basic,95.03,Two year,0,2


## Task 7: Apply One-Hot Encoding to Nominal Variables

In [27]:
# Nominal variables: gender and region
telecom_df['gender'].unique()

array(['Male', 'Other', 'Female'], dtype=object)

In [28]:
telecom_df['region'].unique()

array(['North', 'East', 'South', 'West'], dtype=object)

In [30]:
pd.get_dummies(telecom_df['gender']).astype(int)

Unnamed: 0,Female,Male,Other
0,0,1,0
1,0,1,0
2,0,1,0
3,0,0,1
4,0,1,0
...,...,...,...
95,0,0,1
96,0,1,0
97,1,0,0
98,0,1,0


In [31]:
pd.get_dummies(telecom_df['region']).astype(int)

Unnamed: 0,East,North,South,West
0,0,1,0,0
1,0,1,0,0
2,1,0,0,0
3,0,0,1,0
4,0,0,0,1
...,...,...,...,...
95,0,0,1,0
96,1,0,0,0
97,1,0,0,0
98,0,1,0,0


## Task 8: Merge Encoded Features into the Final DataFrame called `final_telcom_df`

In [39]:
final_telecom_df = telecom_df.drop(['contract', 'plan'], axis=1)
final_telecom_df.head()

Unnamed: 0,customer_id,age,income,monthly_charges,plan_label_encoded,contract_label_encoded,gender_Female,gender_Male,gender_Other,region_East,region_North,region_South,region_West
0,1,56.0,44900.0,99.97,2,2,0,1,0,0,1,0,0
1,2,69.0,38000.0,99.7,0,2,0,1,0,0,1,0,0
2,3,46.0,47600.0,59.99,1,1,0,1,0,1,0,0,0
3,4,32.0,56100.0,79.21,0,2,0,0,1,0,0,1,0
4,5,60.0,78300.0,95.03,0,2,0,1,0,0,0,0,1


In [40]:
!jupyter nbconvert --to html "/content/Data_Preprocessing_and_Transformation_Assignment_Ian_Gabriel_Eusebio.ipynb"

[NbConvertApp] Converting notebook /content/Data_Preprocessing_and_Transformation_Assignment_Ian_Gabriel_Eusebio.ipynb to html
[NbConvertApp] Writing 365729 bytes to /content/Data_Preprocessing_and_Transformation_Assignment_Ian_Gabriel_Eusebio.html
