## How to clean data?
**Submitted by Yamini Manral 002708331**

### Why cleaning data is important?
Data cleaning is crucial in the realm of data analysis and decision-making because it directly impacts the reliability and accuracy of the insights derived from data. It is a critical step in ensuring that data is trustworthy, consistent, and free from errors or impurities. Data cleaning is interesting because it involves detective work, problem-solving, and creative thinking. It requires identifying and addressing a wide range of data quality issues, from missing values and outliers to inconsistent formatting and duplicates. By successfully cleaning and preparing data, analysts and data scientists can unlock the true potential of their datasets, leading to more reliable conclusions and informed decision-making. Ultimately, data cleaning is the foundation upon which data-driven insights and innovations are built, making it a fascinating and indispensable aspect of the data science journey.

### The Fascination of Data Cleaning

Data cleaning, though often perceived as a preliminary and mundane task in the data science journey, holds a unique and captivating allure. Here are some reasons why data cleaning is not just important but also interesting:

1. **Data's Imperfections:** Real-world data is rarely pristine. It's a reflection of the messy, unpredictable nature of human activities. Exploring and uncovering the quirks and imperfections in data is like solving a puzzle.

2. **Data Detective Work:** Data cleaning requires a detective's mindset. You'll investigate missing values, outliers, inconsistencies, and unexpected patterns. It's akin to solving mysteries within the data.

3. **Creativity in Problem-Solving:** Finding innovative ways to address data quality issues is both challenging and creatively fulfilling. You may need to craft custom solutions for unique problems.

4. **Impact on Insights:** The quality of your data profoundly impacts the insights you can derive. Data cleaning directly contributes to the reliability and trustworthiness of your findings.

5. **Multidisciplinary Nature:** Data cleaning draws from diverse fields, including statistics, programming, linguistics (for text data), and domain expertise. It's an interdisciplinary playground.

6. **Continuous Learning:** As technology evolves, so do data cleaning techniques. Staying updated with the latest methods and tools keeps the process engaging.

7. **Data Ethics and Bias:** Exploring data cleaning means confronting ethical questions about data representation and bias mitigation. It's an avenue for discussions about responsible data handling.

8. **Data Transformation:** Data cleaning often leads to data transformation and feature engineering. You'll convert raw data into meaningful variables for analysis, a creative process in itself.

9. **Automation Potential:** While data cleaning can be manual, automation and AI-driven tools are emerging. Building and fine-tuning these tools adds a layer of excitement.

10. **Foundation for Insights:** Data cleaning is the bedrock upon which data-driven insights and machine learning models are built. It's the starting point of any data science adventure.

In summary, data cleaning isn't merely a mundane chore—it's an exploration of data's intricacies, a creative problem-solving exercise, and a critical step in the journey to extract valuable insights. It's where the magic begins in the fascinating world of data science.


### Data Cleaning Techniques

Data cleaning is an essential step in preparing data for analysis or machine learning. Here are some common data cleaning techniques:

1. **Handling Missing Values:**
   - Imputation: Fill missing values using mean, median, mode, or advanced methods.
   - Deletion: Remove rows or columns with a significant number of missing values.

2. **Dealing with Duplicates:**
   - Detect and remove duplicate rows to avoid redundancy.
   - Handle duplicate data based on business rules if needed.

3. **Outlier Detection and Treatment:**
   - Identify and handle outliers using methods like Z-score, IQR, or winsorization.
   - Decide whether to cap, transform, or remove outliers based on domain knowledge.

4. **Standardizing Data:**
   - Standardize units and formats in columns (e.g., converting all dates to a consistent format).
   - Normalize data to have zero mean and unit variance for machine learning.

5. **Converting Data Types:**
   - Ensure data types are appropriate for analysis (e.g., converting categorical variables to numerical using one-hot encoding).
   - Parse dates and times into datetime objects for time series analysis.

6. **Handling Inconsistent Data:**
   - Resolve inconsistencies in data entry (e.g., capitalization, spelling errors) using string manipulation or fuzzy matching.
   - Merge categories with similar meanings in categorical data.

7. **Encoding Categorical Data:**
   - Convert categorical variables into numerical representations using techniques like label encoding or one-hot encoding.

8. **Text Data Cleaning:**
   - Remove special characters, punctuation, and whitespace from text data.
   - Tokenize, lemmatize, or stem text for natural language processing (NLP).

9. **Handling Data Integrity Issues:**
   - Check for data integrity problems, such as referential integrity violations.
   - Correct data inconsistencies in related tables or datasets.

10. **Handling Data Scale and Skewness:**
    - Apply scaling techniques like Min-Max scaling or Z-score scaling to address varying data scales.
    - Log-transform data with skewed distributions to make them more normally distributed.

11. **Data Validation and Cross-Checking:**
    - Cross-check data with external sources or domain-specific rules to validate its accuracy.
    - Perform sanity checks and verify data against known benchmarks.

12. **Data Imputation for Time Series:**
    - Impute missing values in time series data using methods like linear interpolation or forward/backward filling.

13. **Data Sampling and Resampling:**
    - Create balanced datasets by oversampling or undersampling when dealing with imbalanced classes.
    - Resample time series data for different time granularities.

14. **Data Transformation:**
    - Perform feature engineering to create new meaningful features from existing ones.
    - Aggregate or pivot data to create summary statistics or pivot tables.

15. **Data Cleaning Automation:**
    - Develop automated scripts or pipelines for routine data cleaning tasks to ensure consistency.

Remember that the choice of data cleaning techniques depends on the specific characteristics of your dataset and the goals of your analysis or machine learning project. It often involves a combination of these techniques to prepare data for meaningful insights and modeling.

We use pandas and Numpy for performing data cleaning activities. We begin with importing Pandas library and using `read_excel()` function to read data from our desired excel sheet.

We have an excel sheet with details of Customers on it. We need to clean this data to make it consistent and usable.

In [1]:
import pandas as pd
df = pd.read_excel(r'Customer Call List.xlsx')
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Not_Useful_Column
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No,True
1,1002,Abed,Nadir,123/643/9775,93 West Main Street,No,Yes,False
2,1003,Walter,/White,7066950392,298 Drugs Driveway,N,,True
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y,True
4,1005,Jon,Snow,876|678|3469,123 Dragons Road,Y,No,True
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes,True
6,1007,Jeff,Winger,,1209 South Street,No,No,False
7,1008,Sherlock,Holmes,876|678|3469,98 Clue Drive,N,No,False
8,1009,Gandalf,,N/a,123 Middle Earth,Yes,,False
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No,True


Right off the bat, we see problems with this data set: it's messy and inconsistent. Let's make a list of the things we need to fix here:
1. Removing duplicate values
2. Removing special characters (Text Data Cleaning)
3. Formatting the column Phone_Number for consistency
4. Handling Inconsistent Data - Capitalization
5. Splitting columns
6. Dealing with missing values (data imputation)
7. Removing nulls
8. Dropping irrelevant data

Overall, our goal is to clean this dataset so it is consistent and easily comprehensible.

One by one, let us try to achieve this goal. We begin with dropping duplicates.

## 1. Removing duplicate values

In [2]:
# removing duplicates
df = df.drop_duplicates()
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Not_Useful_Column
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No,True
1,1002,Abed,Nadir,123/643/9775,93 West Main Street,No,Yes,False
2,1003,Walter,/White,7066950392,298 Drugs Driveway,N,,True
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y,True
4,1005,Jon,Snow,876|678|3469,123 Dragons Road,Y,No,True
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes,True
6,1007,Jeff,Winger,,1209 South Street,No,No,False
7,1008,Sherlock,Holmes,876|678|3469,98 Clue Drive,N,No,False
8,1009,Gandalf,,N/a,123 Middle Earth,Yes,,False
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No,True


`drop_duplicates()` drops rows that have same values in each column. For our dataset, it drops index number 19.

## 2. Dropping irrelevant columns

In [3]:
# dropping irrelevant data
df = df.drop(columns= "Not_Useful_Column")
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No
1,1002,Abed,Nadir,123/643/9775,93 West Main Street,No,Yes
2,1003,Walter,/White,7066950392,298 Drugs Driveway,N,
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y
4,1005,Jon,Snow,876|678|3469,123 Dragons Road,Y,No
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes
6,1007,Jeff,Winger,,1209 South Street,No,No
7,1008,Sherlock,Holmes,876|678|3469,98 Clue Drive,N,No
8,1009,Gandalf,,N/a,123 Middle Earth,Yes,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No


Now that is done, let's fix each column one by one. Starting with **Last_Name**, we want to remove special characters from this column in order to clean it. To do so, we can use the `str.strip()` function.

> Note: `strip()` only strips away characters from outside of a string, meaning from either end of a string, left or right. If there had been issues in the middle of a last name, we would need to use other cleansing techniques like `replace()`.

In [4]:
df["Last_Name"] = df["Last_Name"].str.strip("123./'\_")
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No
1,1002,Abed,Nadir,123/643/9775,93 West Main Street,No,Yes
2,1003,Walter,White,7066950392,298 Drugs Driveway,N,
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y
4,1005,Jon,Snow,876|678|3469,123 Dragons Road,Y,No
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes
6,1007,Jeff,Winger,,1209 South Street,No,No
7,1008,Sherlock,Holmes,876|678|3469,98 Clue Drive,N,No
8,1009,Gandalf,,N/a,123 Middle Earth,Yes,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No


## 3. Reformatting 

Much cleaner! Let's move on to the **Phone_Number** column and identify the issues. Firstly, we need to remove any special characters. Secondly, we need the numbers in a particular format, i.e _123-456-7890_. 

Removing all non-alphanumeric characters from the column **Phone_Number** using `str.replace()`:

In [5]:
# removing special characters from phone number
df["Phone_Number"] = df["Phone_Number"].str.replace(r'\D', '', regex=True)
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact
0,1001,Frodo,Baggins,1235455421.0,"123 Shire Lane, Shire",Yes,No
1,1002,Abed,Nadir,1236439775.0,93 West Main Street,No,Yes
2,1003,Walter,White,,298 Drugs Driveway,N,
3,1004,Dwight,Schrute,1235432345.0,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y
4,1005,Jon,Snow,8766783469.0,123 Dragons Road,Y,No
5,1006,Ron,Swanson,3047622467.0,768 City Parkway,Yes,Yes
6,1007,Jeff,Winger,,1209 South Street,No,No
7,1008,Sherlock,Holmes,8766783469.0,98 Clue Drive,N,No
8,1009,Gandalf,,,123 Middle Earth,Yes,
9,1010,Peter,Parker,1235455421.0,"25th Main Street, New York",Yes,No


We are still left with NaNs and empty cells which we will deal with later. 

Realise that **Phone** does not have string values, so we cannot directly manipulate it, hence, we will convert each cell value to string using a `lambda` function:

In [6]:
# converting phone numbe col to string
df["Phone_Number"] = df["Phone_Number"].apply(lambda x: str(x))
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact
0,1001,Frodo,Baggins,1235455421.0,"123 Shire Lane, Shire",Yes,No
1,1002,Abed,Nadir,1236439775.0,93 West Main Street,No,Yes
2,1003,Walter,White,,298 Drugs Driveway,N,
3,1004,Dwight,Schrute,1235432345.0,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y
4,1005,Jon,Snow,8766783469.0,123 Dragons Road,Y,No
5,1006,Ron,Swanson,3047622467.0,768 City Parkway,Yes,Yes
6,1007,Jeff,Winger,,1209 South Street,No,No
7,1008,Sherlock,Holmes,8766783469.0,98 Clue Drive,N,No
8,1009,Gandalf,,,123 Middle Earth,Yes,
9,1010,Peter,Parker,1235455421.0,"25th Main Street, New York",Yes,No


Now we want to change each value to the format _123-456-7890_. We can do it by defining a lambda function which will add a dash(-) in the desired places.

In [7]:
# manipulating string to add - between numbers
df["Phone_Number"] = df["Phone_Number"].apply(lambda x: x[0:3] + '-' + x[3:6] + '-' + x[6:10])
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No
1,1002,Abed,Nadir,123-643-9775,93 West Main Street,No,Yes
2,1003,Walter,White,nan--,298 Drugs Driveway,N,
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y
4,1005,Jon,Snow,876-678-3469,123 Dragons Road,Y,No
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes
6,1007,Jeff,Winger,nan--,1209 South Street,No,No
7,1008,Sherlock,Holmes,876-678-3469,98 Clue Drive,N,No
8,1009,Gandalf,,--,123 Middle Earth,Yes,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No


This gets the job done, partially. Let's try to remove nan-- and na-- and -- by using `replace()` to finish with cleaning this column:

In [8]:
df["Phone_Number"] = df["Phone_Number"].str.replace('nan--', '')
df["Phone_Number"] = df["Phone_Number"].str.replace('na--', '')
df["Phone_Number"] = df["Phone_Number"].str.replace('--', '')
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No
1,1002,Abed,Nadir,123-643-9775,93 West Main Street,No,Yes
2,1003,Walter,White,,298 Drugs Driveway,N,
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y
4,1005,Jon,Snow,876-678-3469,123 Dragons Road,Y,No
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes
6,1007,Jeff,Winger,,1209 South Street,No,No
7,1008,Sherlock,Holmes,876-678-3469,98 Clue Drive,N,No
8,1009,Gandalf,,,123 Middle Earth,Yes,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No


## 4. Splitting columns

Much better. Now we can move on to column **Address**. We need to split it into 3 columns: **City, State** and **Zipcode**. We can do this using the `str.split()` function.

In [9]:
df[['Street', 'City', 'Zipcode']] = df['Address'].str.split(', ', expand=True)
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Street,City,Zipcode
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No,123 Shire Lane,Shire,
1,1002,Abed,Nadir,123-643-9775,93 West Main Street,No,Yes,93 West Main Street,,
2,1003,Walter,White,,298 Drugs Driveway,N,,298 Drugs Driveway,,
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y,980 Paper Avenue,Pennsylvania,18503.0
4,1005,Jon,Snow,876-678-3469,123 Dragons Road,Y,No,123 Dragons Road,,
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes,768 City Parkway,,
6,1007,Jeff,Winger,,1209 South Street,No,No,1209 South Street,,
7,1008,Sherlock,Holmes,876-678-3469,98 Clue Drive,N,No,98 Clue Drive,,
8,1009,Gandalf,,,123 Middle Earth,Yes,,123 Middle Earth,,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No,25th Main Street,New York,


We can drop the column **Address** since its contents are now split into 3 separate columns, keeping it would be redundant.

In [10]:
df = df.drop(columns = "Address")

## 5. Handling Inconsistent Data - Capitalization

Now we move on to columns **Paying Customer** and **Do_Not_Contact**. For the sake of consistency, we will keep all the Yes's as Y and No's as N and remove N/a's and replace NaNs with empty spaces and go from there:

In [11]:
df.replace({'Yes': 'Y', 'No': 'N', 'N/a': ''}, inplace=True)
df.fillna('', inplace=True)
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Paying Customer,Do_Not_Contact,Street,City,Zipcode
0,1001,Frodo,Baggins,123-545-5421,Y,N,123 Shire Lane,Shire,
1,1002,Abed,Nadir,123-643-9775,N,Y,93 West Main Street,,
2,1003,Walter,White,,N,,298 Drugs Driveway,,
3,1004,Dwight,Schrute,123-543-2345,Y,Y,980 Paper Avenue,Pennsylvania,18503.0
4,1005,Jon,Snow,876-678-3469,Y,N,123 Dragons Road,,
5,1006,Ron,Swanson,304-762-2467,Y,Y,768 City Parkway,,
6,1007,Jeff,Winger,,N,N,1209 South Street,,
7,1008,Sherlock,Holmes,876-678-3469,N,N,98 Clue Drive,,
8,1009,Gandalf,,,Y,,123 Middle Earth,,
9,1010,Peter,Parker,123-545-5421,Y,N,25th Main Street,New York,


## 6. Renaming columns

We can also rename column names by using `rename()` to maintain consistency:

In [12]:
df.rename(columns={'Paying Customer': 'Paying_Customer', 'CustomerID' : 'Customer_ID'}, inplace=True)
df

Unnamed: 0,Customer_ID,First_Name,Last_Name,Phone_Number,Paying_Customer,Do_Not_Contact,Street,City,Zipcode
0,1001,Frodo,Baggins,123-545-5421,Y,N,123 Shire Lane,Shire,
1,1002,Abed,Nadir,123-643-9775,N,Y,93 West Main Street,,
2,1003,Walter,White,,N,,298 Drugs Driveway,,
3,1004,Dwight,Schrute,123-543-2345,Y,Y,980 Paper Avenue,Pennsylvania,18503.0
4,1005,Jon,Snow,876-678-3469,Y,N,123 Dragons Road,,
5,1006,Ron,Swanson,304-762-2467,Y,Y,768 City Parkway,,
6,1007,Jeff,Winger,,N,N,1209 South Street,,
7,1008,Sherlock,Holmes,876-678-3469,N,N,98 Clue Drive,,
8,1009,Gandalf,,,Y,,123 Middle Earth,,
9,1010,Peter,Parker,123-545-5421,Y,N,25th Main Street,New York,


## 7. Removing nulls

Our data looks much clean now, except there are still empty cell values in it. We can eliminate those by dropping undesired rows. In this case, we decide that by deciding what do we wnat out of the dataset. Here we need a list of customers who need to be contacted and who have a valid phone number. 

In [13]:
for x in df.index:
    if df.loc[x, "Do_Not_Contact"] == 'Y':
        df.drop(x, inplace=True)

df

Unnamed: 0,Customer_ID,First_Name,Last_Name,Phone_Number,Paying_Customer,Do_Not_Contact,Street,City,Zipcode
0,1001,Frodo,Baggins,123-545-5421,Y,N,123 Shire Lane,Shire,
2,1003,Walter,White,,N,,298 Drugs Driveway,,
4,1005,Jon,Snow,876-678-3469,Y,N,123 Dragons Road,,
6,1007,Jeff,Winger,,N,N,1209 South Street,,
7,1008,Sherlock,Holmes,876-678-3469,N,N,98 Clue Drive,,
8,1009,Gandalf,,,Y,,123 Middle Earth,,
9,1010,Peter,Parker,123-545-5421,Y,N,25th Main Street,New York,
10,1011,Samwise,Gamgee,,Y,N,612 Shire Lane,Shire,
11,1012,Harry,Potter,,Y,,2394 Hogwarts Avenue,,
12,1013,Don,Draper,123-543-2345,Y,N,2039 Main Street,,


We also don't want rows with no phone number at all, so we can drop these rows:

In [14]:
for x in df.index:
    if df.loc[x, "Phone_Number"] == '':
        df.drop(x, inplace=True)

df

Unnamed: 0,Customer_ID,First_Name,Last_Name,Phone_Number,Paying_Customer,Do_Not_Contact,Street,City,Zipcode
0,1001,Frodo,Baggins,123-545-5421,Y,N,123 Shire Lane,Shire,
4,1005,Jon,Snow,876-678-3469,Y,N,123 Dragons Road,,
7,1008,Sherlock,Holmes,876-678-3469,N,N,98 Clue Drive,,
9,1010,Peter,Parker,123-545-5421,Y,N,25th Main Street,New York,
12,1013,Don,Draper,123-543-2345,Y,N,2039 Main Street,,
13,1014,Leslie,Knope,876-678-3469,Y,N,343 City Parkway,,
14,1015,Toby,Flenderson,304-762-2467,N,N,214 HR Avenue,,
15,1016,Ron,Weasley,123-545-5421,N,N,2395 Hogwarts Avenue,,
16,1017,Michael,Scott,123-643-9775,Y,N,121 Paper Avenue,Pennsylvania,
19,1020,Anakin,Skywalker,876-678-3469,Y,N,910 Tatooine Road,Tatooine,


At last, resetting index to get true count of rows.

In [15]:
df = df.reset_index(drop=True)
df

Unnamed: 0,Customer_ID,First_Name,Last_Name,Phone_Number,Paying_Customer,Do_Not_Contact,Street,City,Zipcode
0,1001,Frodo,Baggins,123-545-5421,Y,N,123 Shire Lane,Shire,
1,1005,Jon,Snow,876-678-3469,Y,N,123 Dragons Road,,
2,1008,Sherlock,Holmes,876-678-3469,N,N,98 Clue Drive,,
3,1010,Peter,Parker,123-545-5421,Y,N,25th Main Street,New York,
4,1013,Don,Draper,123-543-2345,Y,N,2039 Main Street,,
5,1014,Leslie,Knope,876-678-3469,Y,N,343 City Parkway,,
6,1015,Toby,Flenderson,304-762-2467,N,N,214 HR Avenue,,
7,1016,Ron,Weasley,123-545-5421,N,N,2395 Hogwarts Avenue,,
8,1017,Michael,Scott,123-643-9775,Y,N,121 Paper Avenue,Pennsylvania,
9,1020,Anakin,Skywalker,876-678-3469,Y,N,910 Tatooine Road,Tatooine,


## Conclusion

In this exploration of data cleaning and manipulation in Python, we've covered a range of essential techniques and practices for preparing data for analysis or machine learning. Starting from loading and cleaning datasets, we learned how to handle missing values, deal with duplicates, and text data cleaning. 
We delved into various aspects of data cleaning, such as standardizing data and, handling inconsistent entries.

Throughout this journey, we highlighted the multidisciplinary nature of data cleaning, which draws from fields like statistics, programming, and domain expertise. We also emphasized the creative problem-solving aspect of data cleaning, as it often requires innovative solutions tailored to the specific characteristics of the data.

Lastly, we learned how to perform DataFrame manipulations, all of which are essential skills for data preparation.

In summary, data cleaning is not just a preliminary step but an engaging exploration of data's intricacies. It's where the foundation for meaningful insights and machine learning models is laid, making it a vital and fascinating aspect of the data science journey.

## References:

1. [Pythonic Data Cleaning With pandas and NumPy](https://realpython.com/python-data-cleaning-numpy-pandas/)
2. [How to Clean Your Data in Python](https://towardsdatascience.com/how-to-clean-your-data-in-python-8f178638b98d)
3. [Pandas - Cleaning Data](https://www.w3schools.com/python/pandas/pandas_cleaning.asp)
4. [Data Cleaning in Pandas | Python Pandas Tutorials](https://www.youtube.com/watch?v=bDhvCp3_lYw)