### Features explanation 

|Feature| Explanation|
|:-:|:---|
|ContactId| Unique identifier for each contact in the database.
|FirstName| The first name of the contact.
|LastName| The last name of the contact.
|FullName| The complete name of the contact, often a combination of first name and last name. 
|DateOfBirth| The contact's date of birth (Format: YYYY-MM-DD).
|Gender| The gender of the contact (e.g., Male, Female, Other).
|Email| The contact’s email address.
|Telephone| The contact’s telephone number, including country and area codes.
|PostCode| The postal code corresponding to the contact’s address.
|StreetAddress| The street address where the contact resides.
|City| The city where the contact resides.
|State| The state or province where the contact resides.
|Country| The country where the contact resides.
|Created On| The date and time when the contact was added to the system (Format: YYYY-MM-DD HH:MM).
|Headshot| A link or file reference to the contact’s headshot (image file).
|Loyalty Tier| The loyalty tier assigned to the contact (e.g., Bronze, Silver, Gold), indicating their customer status or engagement level with the company.
|Email Subscriber| Indicates whether the contact has subscribed to receive marketing emails (Yes/No).
|Income| The estimated or reported income of the contact, usually represented annually.
|Occupation| The contact's current occupation or job title.
|CustomerSatisfaction| 'high', 'medium', 'low'



### Data Load

In [1]:
import pandas as pd

base_df = pd.read_csv("https://filestransfer.blob.core.windows.net/ciad/Contact.txt", delimiter=',')

base_df.head(100)

Unnamed: 0,ContactId,FirstName,LastName,FullName,DateOfBirth,Gender,EMail,Telephone,PostCode,StreetAddress,City,State,Country,CreatedOn,Headshot,Loyalty Tier,Email Subscriber,Income,Occupation,CustomerSatisfaction
0,CNTID_1000,Abbie,Moss,Abbie Moss,5/8/1986,Female,abbie_moss@collinsreedandhoward.com,983.566.0706x9509,10753,129 Miller Plaza,Fairfield,California,USA,3/14/2017,https://filestransfer.blob.core.windows.net/de...,high,Yes,256414.112709,Software Engineer,high
1,CNTID_1001,Kenneth,Beraun,Beraun Kenneth,8/1/1974,Male,kenneth_beraun@kimboyle.com,384.995.7852,40482,9720 William Prairie,Amarillo,Texas,USA,12/23/2018,https://filestransfer.blob.core.windows.net/de...,medium,Yes,46732.467292,Teacher,high
2,CNTID_1002,Anthony,Koteles,Acthony Koteles,8/28/1975,Male,anthony_koteles@crawfordsimmonsandgreene.com,569-626-5660,28679,3958 Perez Centers Suite 216,Inglewood,California,USA,1/14/2019,https://filestransfer.blob.core.windows.net/de...,medium,Yes,20000.000000,Teacher,high
3,CNTID_1003,Michael,Lauser,Michael Lauser,9/3/2006,Male,michael_lauser@smithinc.com,001-811-506-2553x442,93991,15091 Haynes Neck,Nashville,Tennessee,USA,1/17/2019,https://filestransfer.blob.core.windows.net/de...,medium,No,61917.706144,Teacher,medium
4,CNTID_1004,Richard,Nakada,Nakada Richard,7/30/1997,Male,richard_nakada@jonesholmesandmooney.com,857-147-6531,78389,3501 Thornton Radial,West Covina,California,USA,1/20/2019,https://filestransfer.blob.core.windows.net/de...,medium,No,20000.000000,Teacher,medium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,CNTID_1095,Edward,Bagdasarian,Edward Baadasarian,7/4/1980,Male,edward_bagdasarian@davisfoxandjohnson.com,(536)369-8888x3479,19161,561 Jonathan Inlet Apt. 481,Newark,New Jersey,USA,1/31/2019,https://filestransfer.blob.core.windows.net/de...,high,Yes,72525.181509,Software Engineer,high
96,CNTID_1096,George,Rubero,Rubero George,11/16/1997,Male,george_rubero@browncampbellandwarner.com,001-279-768-5125,82715,72407 James Forges Suite 289,Temecula,California,USA,1/31/2019,https://filestransfer.blob.core.windows.net/de...,medium,No,20000.000000,Software Engineer,low
97,CNTID_1097,Brian,Batson,Brian Batson,4/17/2008,Male,brian_batson@jamesdudleyandgarcia.com,120-770-2761x538,31903,046 Mclaughlin Street,Ontario,California,USA,1/31/2019,https://filestransfer.blob.core.windows.net/de...,medium,No,124611.128425,Teacher,medium
98,CNTID_1098,Joseph,Borell,Joseph Borell,7/27/2007,Male,joseph_borell@mendozaandsons.com,328-693-3726x1665,10339,395 Justin Knolls Apt. 890,Fayetteville,North Carolina,USA,1/31/2019,https://filestransfer.blob.core.windows.net/de...,medium,No,20000.000000,Teacher,medium


In [2]:
print(base_df.columns)

Index(['ContactId', 'FirstName', 'LastName', 'FullName', 'DateOfBirth',
       'Gender', 'EMail', 'Telephone', 'PostCode', 'StreetAddress', 'City',
       'State', 'Country', 'CreatedOn', 'Headshot', 'Loyalty Tier',
       'Email Subscriber', 'Income', 'Occupation', 'CustomerSatisfaction'],
      dtype='object')


### Exploratory Data Analysis

In [3]:
# Volume of the Dataset
num_records = len(base_df)
num_features = len(base_df.columns)

print(f"The dataset has {num_records} records and {num_features} features.")

The dataset has 5000 records and 20 features.


In [4]:
# Basic data exploration
print("Shape of the dataset:", base_df.shape)

Shape of the dataset: (5000, 20)


In [5]:
print("Data types:", base_df.dtypes)

Data types: ContactId                object
FirstName                object
LastName                 object
FullName                 object
DateOfBirth              object
Gender                   object
EMail                    object
Telephone                object
PostCode                 object
StreetAddress            object
City                     object
State                    object
Country                  object
CreatedOn                object
Headshot                 object
Loyalty Tier             object
Email Subscriber         object
Income                  float64
Occupation               object
CustomerSatisfaction     object
dtype: object


In [6]:
print("Missing values in each column:", base_df.isnull().sum())

Missing values in each column: ContactId               0
FirstName               0
LastName                0
FullName                0
DateOfBirth             1
Gender                  0
EMail                   0
Telephone               0
PostCode                0
StreetAddress           0
City                    0
State                   0
Country                 0
CreatedOn               0
Headshot                0
Loyalty Tier            0
Email Subscriber        0
Income                  0
Occupation              0
CustomerSatisfaction    0
dtype: int64


In [7]:
base_df.describe()

Unnamed: 0,Income
count,5000.0
mean,82749.885539
std,56499.410249
min,20000.0
25%,38197.144745
50%,73990.397082
75%,111254.694585
max,419076.115889


### Missing Values
Since it was not found vas amounts of missing values we will be dropping the only feature that has them

In [8]:
# Dropping rows with missing values in critical columns
df_cleaned = base_df.dropna(subset=[
    "DateOfBirth"
])

#### Removing Unnecessary Columns

We remove columns that do not contribute to the analysis or predictive models. By eliminating these irrelevant features, we streamline the dataset, reduce noise, and improve the efficiency of the data processing pipeline.


In [None]:
# List of columns to drop, the ones dropped 
columns_to_drop = ['DateOfBirth', 'Gender', 'EMail', 'Telephone', 'PostCode', 'StreetAddress', 'City', 'State', 'Country', 'CreatedOn', 'Headshot', 'Loyalty Tier', 'Email Subscriber', 'Income', 'Occupation', 'CustomerSatisfaction']

# Drop the columns from the DataFrame
df_cleaned = df_cleaned.drop(columns=columns_to_drop)

df_cleaned.head()