In data science or Machine learning, one of the most important tasks performed when working with large amounts of data is data cleaning. In this assignment, you'll see how to clean a dataset using pandas step by step. This dataset is a simulation of a customer list.

**Q1	What is machine learning? Where and why you will use machine learning?**

Ans: Machine learning, a branch of artificial intelligence (AI) and computer science, mimics human learning by utilizing data and algorithms to enhance accuracy over time. AI encompasses the imitation of intelligent human behavior in complex tasks. Machine learning is an AI method that instructs computers to learn from experience, adapting and improving based on available data.

Machine learning is a versatile technology with numerous applications across various domains. It's particularly valuable in situations where traditional rule-based programming is not practical or would be less effective. Machine learning's ability to learn from data and adapt over time makes it a powerful tool for solving a wide range of problems. Below are some common areas where machine learning is used and the reasons why it's applied:

1. Natural Language Processing (NLP): Machine learning techniques enable language translation, chatbots, and sentiment analysis in applications like customer support and content generation.
2. Healthcare: Machine learning can be used for disease diagnosis, predicting patient outcomes, drug discovery, and personalized treatment plans, improving the efficiency and accuracy of healthcare services.
3. Finance: In finance, machine learning is used for fraud detection, stock market prediction, credit risk assessment, and algorithmic trading to make data-driven decisions and reduce financial risks.
4. E-commerce: Recommendation systems powered by machine learning help platforms like Amazon and Netflix suggest products or content to users, leading to increased user engagement and sales.
5. Energy and Utilities: Machine learning is applied in energy consumption prediction, grid management, and renewable energy resource optimization to enhance sustainability and efficiency.

**Q2 What is normalization/Scaling in Machine Learning and why do you perform? Explain with examples**

Ans: Normalization (or scaling) is a preprocessing technique in machine learning that adjusts the scale of features (variables or columns) in a dataset. This process is crucial for various machine learning algorithms to ensure that all features have an equal impact on the model's performance. It is typically done to rescale data to a common range, often between 0 and 1 or -1 and 1.

Example 1:
Consider a linear regression model that predicts a person's salary (income in dollars) based on their age in years. The income feature might have a large scale, ranging from thousands to millions, while the age feature typically ranges from the early teens to around 100. If you apply gradient descent to optimize the model's parameters, it will have difficulty balancing the influence of these two features.

Normalizing the features by applying a method like Min-Max scaling (scaling them to a common range, often between 0 and 1) or z-score normalization (scaling to have a mean of 0 and standard deviation of 1) ensures that both income and age are on a similar scale. As a result, the optimization process is more likely to converge efficiently and reach the optimal solution for the linear regression problem.

Example2: 
Let's consider a different scenario in the context of a machine learning classification problem. Imagine you are building a model to predict whether a loan applicant is likely to default on their loan based on two features: "credit score" and "loan amount." These features have vastly different scales, which can affect the training of machine learning models, particularly when using algorithms like logistic regression or support vector machines.

To address this challenge, normalizing the features is essential. By applying Min-Max scaling or z-score normalization, you ensure that both "credit score" and "loan amount" are on a similar scale. This normalization process enables the model to consider both features equally when making predictions about loan defaults. It enhances the model's ability to balance the influence of creditworthiness and loan amount, leading to more accurate predictions regarding loan default probabilities. 

**Q3 What is supervised and unsupervised learning? Give some examples** 

Ans:
SUPERVISED LEARNING: Supervised learning is a type of machine learning where an algorithm learns from labeled training data to make predictions or decisions without human intervention. In this learning paradigm, the algorithm is provided with a dataset in which both the input data (features) and the corresponding correct output (labels or target) are known. The goal of supervised learning is to learn a mapping from inputs to outputs, allowing the algorithm to make accurate predictions or classifications on new, unseen data.

Example 1: Handwritten Digit Recognition: In this application, the algorithm learns to recognize handwritten digits, such as those on checks or postal codes, by training on a dataset of handwritten digits labeled with their corresponding values.

Example 2: Language Translation: In machine translation, models are trained to convert text from one language to another, using bilingual text corpora as labeled data for training.

Example 3: Fraud Detection: Banks and credit card companies use supervised learning models to detect fraudulent transactions by learning from past transaction data, labeling legitimate and fraudulent transactions.

Example 4: Credit Scoring: Financial institutions use supervised learning to predict a borrower's creditworthiness by learning from historical data, including a borrower's financial history and whether they defaulted on previous loans.

UNSUPERVISED LEARNING: Unsupervised learning is a type of machine learning where the algorithm learns patterns and structures in data without the guidance of labeled outputs or target values. In unsupervised learning, the algorithm explores the inherent structure within the data to discover relationships, clusters, or other insights.

Example 1: Recommendation Systems: Companies like Netflix and Amazon use unsupervised learning to recommend products or movies to users based on their past behavior and preferences. The system groups users with similar interests and recommends items that others in the same group have liked.

Example 2: Grocery Shopping: When you go grocery shopping, you may unconsciously group similar items together in your cart or basket, such as vegetables, dairy, and canned goods. This is an unsupervised clustering behavior based on the similarity of items.

Example 3: Choosing a Restaurant: When deciding where to eat, you might consider factors like cuisine, location, and price, ultimately grouping restaurants with similar attributes in your decision-making process.

**Q4 What is Data Cleaning and why do we need it?**  

Ans: Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in datasets. It is an essential step in data preprocessing and data preparation for analysis, machine learning, or any data-driven task.
Below are some reasons why it is needed:

1. Ensuring Data Accuracy: Data collected from different sources or through various methods can contain errors, inconsistencies, and inaccuracies. Cleaning the data helps ensure that it is accurate and free from errors, which is essential for making sound decisions and drawing reliable conclusions.
2. Improving Data Quality: Data cleaning enhances the overall quality of the dataset by identifying and rectifying issues such as missing values, duplicates, and outliers. This, in turn, leads to better, more reliable results in data analysis or machine learning.
3. Enhancing Data Consistency: Consistency in data is crucial for meaningful analysis. Data cleaning helps in standardizing data, such as ensuring consistent date formats, unit measurements, or category labels. Inconsistencies can lead to confusion and incorrect interpretations.
4. Supporting Reliable Analysis: Clean data is essential for accurate and dependable data analysis. Without proper data cleaning, errors or inconsistencies in the data can lead to incorrect conclusions or misleading insights.
5. Facilitating Data Integration: When working with multiple datasets, data cleaning helps align and standardize data, making it easier to integrate and analyze across different sources. It ensures that data from various origins can be used effectively together.
6. Reducing Bias: In some cases, data may be biased or contain sampling errors. Data cleaning can help mitigate these issues, making the data more representative of the population it aims to describe.
7. Optimizing Decision-Making: Accurate and reliable data is essential for making informed and confident decisions. Whether it's in business, healthcare, research, or any other domain, data cleaning ensures the data's trustworthiness and suitability for decision-making.
8. Improving Model Performance: In the context of machine learning, the performance of models is heavily dependent on the quality of the training data. Clean data helps machine learning models make better predictions and classifications.

In [1]:
import pandas as pd

In [2]:
df_users = pd.DataFrame({
    "user_id": [234, 235, 236, 237, 237, 238, 239, 240, 241, 242, 242],
    "Name": ["Tom", "Alex--", "..Thomas", "John", "John", "Paul/", "Emma9", "Joy", "Samantha_", "Emily", "Emily"],
    "Last_name": ["Smith", "johnson", "brown", "Davis", "Davis", "None", "wilson", "Thompson", "Lee", "Johnson", "Johnson"],
    "age": [23, 32, 45, 22, 22, 50, 34, 47, 28, 19, 19],
    "Phone": ["555/123/4567", "333-234-5678", "444_456_7890", "111-222-3333", "111-222-3333", None, "333/987/4567", "222/345_987", "(777) 987-6543", "777-888-9999", "777-888-9999"],
    "Email": ["smith@email.com", "johnson@hotmail.com", "brown@email.com", "davis@mail.com", "davis@mail.com", "John@gmail.com", "wilson@mail.com", "thompson@email.com", "lee@email.com", "emily@hotmail.com", "emily@hotmail.com"],
    "Not_Useful_column": [None, None, None, None, None, None, None, None, None, None, None]
})

print(df_users)

    user_id       Name Last_name  age           Phone                Email  \
0       234        Tom     Smith   23    555/123/4567      smith@email.com   
1       235     Alex--   johnson   32    333-234-5678  johnson@hotmail.com   
2       236   ..Thomas     brown   45    444_456_7890      brown@email.com   
3       237       John     Davis   22    111-222-3333       davis@mail.com   
4       237       John     Davis   22    111-222-3333       davis@mail.com   
5       238      Paul/      None   50            None       John@gmail.com   
6       239      Emma9    wilson   34    333/987/4567      wilson@mail.com   
7       240        Joy  Thompson   47     222/345_987   thompson@email.com   
8       241  Samantha_       Lee   28  (777) 987-6543        lee@email.com   
9       242      Emily   Johnson   19    777-888-9999    emily@hotmail.com   
10      242      Emily   Johnson   19    777-888-9999    emily@hotmail.com   

   Not_Useful_column  
0               None  
1               N

In [3]:
# Remove duplicate data
df_users = df_users.drop_duplicates(subset='user_id')
df_users

Unnamed: 0,user_id,Name,Last_name,age,Phone,Email,Not_Useful_column
0,234,Tom,Smith,23,555/123/4567,smith@email.com,
1,235,Alex--,johnson,32,333-234-5678,johnson@hotmail.com,
2,236,..Thomas,brown,45,444_456_7890,brown@email.com,
3,237,John,Davis,22,111-222-3333,davis@mail.com,
5,238,Paul/,,50,,John@gmail.com,
6,239,Emma9,wilson,34,333/987/4567,wilson@mail.com,
7,240,Joy,Thompson,47,222/345_987,thompson@email.com,
8,241,Samantha_,Lee,28,(777) 987-6543,lee@email.com,
9,242,Emily,Johnson,19,777-888-9999,emily@hotmail.com,


In [4]:
#Remove special characters from Name column and capitalize last name
df_users['Name'] = df_users['Name'].str.replace(r'[^A-Za-z\s]', '', regex=True)
df_users['Last_name'] = df_users['Last_name'].str.capitalize()
df_users

Unnamed: 0,user_id,Name,Last_name,age,Phone,Email,Not_Useful_column
0,234,Tom,Smith,23,555/123/4567,smith@email.com,
1,235,Alex,Johnson,32,333-234-5678,johnson@hotmail.com,
2,236,Thomas,Brown,45,444_456_7890,brown@email.com,
3,237,John,Davis,22,111-222-3333,davis@mail.com,
5,238,Paul,,50,,John@gmail.com,
6,239,Emma,Wilson,34,333/987/4567,wilson@mail.com,
7,240,Joy,Thompson,47,222/345_987,thompson@email.com,
8,241,Samantha,Lee,28,(777) 987-6543,lee@email.com,
9,242,Emily,Johnson,19,777-888-9999,emily@hotmail.com,


In [5]:
#Remove special charaters from phone and reformat in xxx-xxx-xxxx format
df_users['Phone'] = df_users['Phone'].str.replace(r'[^\d]', '', regex=True) 
df_users['Phone'] = df_users['Phone'].str.replace(r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3', regex=True)
df_users

Unnamed: 0,user_id,Name,Last_name,age,Phone,Email,Not_Useful_column
0,234,Tom,Smith,23,555-123-4567,smith@email.com,
1,235,Alex,Johnson,32,333-234-5678,johnson@hotmail.com,
2,236,Thomas,Brown,45,444-456-7890,brown@email.com,
3,237,John,Davis,22,111-222-3333,davis@mail.com,
5,238,Paul,,50,,John@gmail.com,
6,239,Emma,Wilson,34,333-987-4567,wilson@mail.com,
7,240,Joy,Thompson,47,222345987,thompson@email.com,
8,241,Samantha,Lee,28,777-987-6543,lee@email.com,
9,242,Emily,Johnson,19,777-888-9999,emily@hotmail.com,


In [6]:
# Remove rows with incorrect phone format
correct_phone_format = df_users['Phone'].str.contains(r'\d{3}-\d{3}-\d{4}', na=False)
df_users = df_users[correct_phone_format]
df_users

Unnamed: 0,user_id,Name,Last_name,age,Phone,Email,Not_Useful_column
0,234,Tom,Smith,23,555-123-4567,smith@email.com,
1,235,Alex,Johnson,32,333-234-5678,johnson@hotmail.com,
2,236,Thomas,Brown,45,444-456-7890,brown@email.com,
3,237,John,Davis,22,111-222-3333,davis@mail.com,
6,239,Emma,Wilson,34,333-987-4567,wilson@mail.com,
8,241,Samantha,Lee,28,777-987-6543,lee@email.com,
9,242,Emily,Johnson,19,777-888-9999,emily@hotmail.com,


In [7]:
# Remove unused columns
df_users = df_users.drop(columns=['Not_Useful_column'])
df_users

Unnamed: 0,user_id,Name,Last_name,age,Phone,Email
0,234,Tom,Smith,23,555-123-4567,smith@email.com
1,235,Alex,Johnson,32,333-234-5678,johnson@hotmail.com
2,236,Thomas,Brown,45,444-456-7890,brown@email.com
3,237,John,Davis,22,111-222-3333,davis@mail.com
6,239,Emma,Wilson,34,333-987-4567,wilson@mail.com
8,241,Samantha,Lee,28,777-987-6543,lee@email.com
9,242,Emily,Johnson,19,777-888-9999,emily@hotmail.com


In [8]:
# Reset index after data cleaning
df_users = df_users.reset_index(drop=True)
df_users

Unnamed: 0,user_id,Name,Last_name,age,Phone,Email
0,234,Tom,Smith,23,555-123-4567,smith@email.com
1,235,Alex,Johnson,32,333-234-5678,johnson@hotmail.com
2,236,Thomas,Brown,45,444-456-7890,brown@email.com
3,237,John,Davis,22,111-222-3333,davis@mail.com
4,239,Emma,Wilson,34,333-987-4567,wilson@mail.com
5,241,Samantha,Lee,28,777-987-6543,lee@email.com
6,242,Emily,Johnson,19,777-888-9999,emily@hotmail.com


Here we use the pandas `DataFrame()` function to create a mock dataset, this dataset contains 7 columns and 11 rows, the columns are, a `user_id` which is the user's unique id, a `Name` column, a `Last_name` column, the user's `age`, the user's `Phone` number, the user's `Email`, and finally a non-useful column called `Not_Useful_column` which we will use as an example of how to delete an unnecessary column from a dataset.

As you can see in the example dataset, the data has some inconsistencies in the columns, a few unnecessary symbols in the `Name` column, some values in the `Last_name` column are not capitalized, and each of the values in the `Phone` column have different syntax which makes it difficult to work with them.


Your final output should look like 

In [9]:
df_users

Unnamed: 0,user_id,Name,Last_name,age,Phone,Email
0,234,Tom,Smith,23,555-123-4567,smith@email.com
1,235,Alex,Johnson,32,333-234-5678,johnson@hotmail.com
2,236,Thomas,Brown,45,444-456-7890,brown@email.com
3,237,John,Davis,22,111-222-3333,davis@mail.com
4,239,Emma,Wilson,34,333-987-4567,wilson@mail.com
5,241,Samantha,Lee,28,777-987-6543,lee@email.com
6,242,Emily,Johnson,19,777-888-9999,emily@hotmail.com
