**Project 8 - Data Cleaning on Feedback Dataset**

**Data cleaning** is the process of correcting or removing corrupt, incorrect, or unnecessary data from a data set before data analysis.

Expanding on this basic definition, data cleaning, often grouped with data cleansing, data scrubbing, and data preparation, serves to turn your messy, potentially problematic data into clean data. Importantly, that’s ‘clean data’ defined as data that the powerful data analysis engines you spent money on can actually use.

In [63]:
#Let’s get Pandas and NumPy up and running on your Python script
import pandas as pd
import numpy as np

In [71]:
#Input Customer Feedback Dataset
data = pd.read_csv('feedbackTable.csv')
data

Unnamed: 0,Rating,Review Title,Review,Customer Name,Date,Review ID
0,4,Works well,"\nThe product works fine, it is maybe a little...",Phillip,October 10 2021,#7653
1,3,good enough,,elena,October 5 2021,
2,5,Everyone should buy this,You should buy this.,Olivia,,
3,5,Amazing product,"Love everything about this product,it works g...",John,10/5/2021,
4,1,this is terrible,The product never worked for me.,Paula,44491.00,#8563
5,2,Doesnt work,This doesnt work as advertised.,Ellie,,
6,4,cool product,"This worked well for me,Not 5 stars because t...",DAVE,September 5 2021,#4162
7,5,BEST THING EVER,"Go and buy this right now, it's amazing.",Pablo,,
8,5,Amazing product,"Love everything about this product,it works g...",John,10/5/2021,#5675
9,5,Love this!!,Iwould 100% recommend this to everone.,CARA,,


**Locate Missing Data**

In [72]:
data.isnull()


Unnamed: 0,Rating,Review Title,Review,Customer Name,Date,Review ID
0,False,False,False,False,False,False
1,False,False,True,False,False,True
2,False,False,False,False,True,True
3,False,False,False,False,False,True
4,False,False,False,False,False,False
5,False,False,False,False,True,True
6,False,False,False,False,False,False
7,False,False,False,False,True,True
8,False,False,False,False,False,False
9,False,False,False,False,True,True


Our output result is a list of boolean values. 

There are several insights the list can give us. First and foremost is where the missing data is – any ‘True’ reading under a column indicates missing data in that column’s category for that data file.

So, for example, datapoint 1 has missing data in its Review section and its Review ID section (both are marked true). 

We can further expand the missing data of each feature by coding:

In [73]:
data.isnull().sum()


Rating           0
Review Title     0
Review           1
Customer Name    0
Date             5
Review ID        7
dtype: int64

From here, we use code to actually clean the data. This boils down to two basic options. 

1) Drop the data or, 

2) Input missing data. If you opt to:

**Drop the data**

You’ll have to make another decision – whether to drop only the missing values and keep the data in the set, or to eliminate the feature (the entire column) wholesale because there are so many missing datapoints that it isn’t fit for analysis.

If you want to drop the missing values you’ll have to go in and mark them void according to Pandas or NumPy standards.
But if you want to drop the entire column, here’s the code:

In [74]:
remove = ['Review ID','Date']
data.drop(remove, inplace =True, axis =1)

In [75]:
data

Unnamed: 0,Rating,Review Title,Review,Customer Name
0,4,Works well,"\nThe product works fine, it is maybe a little...",Phillip
1,3,good enough,,elena
2,5,Everyone should buy this,You should buy this.,Olivia
3,5,Amazing product,"Love everything about this product,it works g...",John
4,1,this is terrible,The product never worked for me.,Paula
5,2,Doesnt work,This doesnt work as advertised.,Ellie
6,4,cool product,"This worked well for me,Not 5 stars because t...",DAVE
7,5,BEST THING EVER,"Go and buy this right now, it's amazing.",Pablo
8,5,Amazing product,"Love everything about this product,it works g...",John
9,5,Love this!!,Iwould 100% recommend this to everone.,CARA


Now, let’s examine our other option.
**Input missing data**

Technically, the method described above of filling in individual values with Pandas or NumBy standards is also a form of inputting missing data – we call it adding ‘No Review’. When it comes to inputting missing data you can either add ‘No Review’ using the code below, or manually fill in the correct data.



In [77]:
data['Review'] = data['Review'].fillna('No review')
data

Unnamed: 0,Rating,Review Title,Review,Customer Name
0,4,Works well,"\nThe product works fine, it is maybe a little...",Phillip
1,3,good enough,No review,elena
2,5,Everyone should buy this,You should buy this.,Olivia
3,5,Amazing product,"Love everything about this product,it works g...",John
4,1,this is terrible,The product never worked for me.,Paula
5,2,Doesnt work,This doesnt work as advertised.,Ellie
6,4,cool product,"This worked well for me,Not 5 stars because t...",DAVE
7,5,BEST THING EVER,"Go and buy this right now, it's amazing.",Pablo
8,5,Amazing product,"Love everything about this product,it works g...",John
9,5,Love this!!,Iwould 100% recommend this to everone.,CARA


**Check for Duplicates**

Duplicates, like missing data, cause problems and clog up analytics software. Let’s locate and eliminate them.

To locate duplicates we start out with:

In [78]:
data.duplicated()


0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8      True
9     False
10    False
11    False
dtype: bool

list of boolean values where a ‘True’ reading indicated duplicate values.

Let’s go and get ahead and get rid of that duplicate (datapoint 8).

In [79]:
data.drop_duplicates()


Unnamed: 0,Rating,Review Title,Review,Customer Name
0,4,Works well,"\nThe product works fine, it is maybe a little...",Phillip
1,3,good enough,No review,elena
2,5,Everyone should buy this,You should buy this.,Olivia
3,5,Amazing product,"Love everything about this product,it works g...",John
4,1,this is terrible,The product never worked for me.,Paula
5,2,Doesnt work,This doesnt work as advertised.,Ellie
6,4,cool product,"This worked well for me,Not 5 stars because t...",DAVE
7,5,BEST THING EVER,"Go and buy this right now, it's amazing.",Pablo
9,5,Love this!!,Iwould 100% recommend this to everone.,CARA
10,100,Hate this,This doesnt do anylhing for me.,Helen


**Detect Outliers**

Outliers are numerical values that lie significantly outside of the statistical norm. Cutting that down from unnecessary science garble – they are data points that are so out of range they are likely misreads. 

They, like duplicates, need to be removed. Let’s sniff out an outlier by first, pulling up our dataset.

In [80]:
data['Rating'].describe()


count     12.000000
mean      11.833333
std       27.797427
min        1.000000
25%        3.000000
50%        4.500000
75%        5.000000
max      100.000000
Name: Rating, dtype: float64

Take a look at that ‘max’ value - none of the other values are even close to 100, with the mean (the average) being 11. Now, your solution to outliers will depend on your knowledge of your dataset. In this case, the data scientists who input the knowledge know that they meant to put a value of 1 not 100. So, we can safely remove the outlier to fix our data.

In [82]:
data.loc[10,'Rating'] = 1
data

Unnamed: 0,Rating,Review Title,Review,Customer Name
0,4,Works well,"\nThe product works fine, it is maybe a little...",Phillip
1,3,good enough,No review,elena
2,5,Everyone should buy this,You should buy this.,Olivia
3,5,Amazing product,"Love everything about this product,it works g...",John
4,1,this is terrible,The product never worked for me.,Paula
5,2,Doesnt work,This doesnt work as advertised.,Ellie
6,4,cool product,"This worked well for me,Not 5 stars because t...",DAVE
7,5,BEST THING EVER,"Go and buy this right now, it's amazing.",Pablo
8,5,Amazing product,"Love everything about this product,it works g...",John
9,5,Love this!!,Iwould 100% recommend this to everone.,CARA


Now our dataset has ratings ranging from 1 to 5, which will save major skew from if there was a rogue 100 in there.



**Normalize Casing**

Last but not least we are going to dot our i’s and cross our t’s. Meaning we are going to standardize (lowercase) all review titles so as not to confuse our algorithms, and we are going to capitalize Customer Names, so that our algorithms know they are variables (you’ll see this in action below).

Here’s how to make every review title lowercase:



In [83]:
data['Review Title'] = data['Review Title'].str.lower()
data

Unnamed: 0,Rating,Review Title,Review,Customer Name
0,4,works well,"\nThe product works fine, it is maybe a little...",Phillip
1,3,good enough,No review,elena
2,5,everyone should buy this,You should buy this.,Olivia
3,5,amazing product,"Love everything about this product,it works g...",John
4,1,this is terrible,The product never worked for me.,Paula
5,2,doesnt work,This doesnt work as advertised.,Ellie
6,4,cool product,"This worked well for me,Not 5 stars because t...",DAVE
7,5,best thing ever,"Go and buy this right now, it's amazing.",Pablo
8,5,amazing product,"Love everything about this product,it works g...",John
9,5,love this!!,Iwould 100% recommend this to everone.,CARA


Looks great! On to making sure our high-powered programs don’t get tripped up and miscategorize a customer name because it isn’t capitalized. Here’s how to ensure Customer Name capitalization:

In [84]:
data['Customer Name'] = data['Customer Name'].str.title()
data

Unnamed: 0,Rating,Review Title,Review,Customer Name
0,4,works well,"\nThe product works fine, it is maybe a little...",Phillip
1,3,good enough,No review,Elena
2,5,everyone should buy this,You should buy this.,Olivia
3,5,amazing product,"Love everything about this product,it works g...",John
4,1,this is terrible,The product never worked for me.,Paula
5,2,doesnt work,This doesnt work as advertised.,Ellie
6,4,cool product,"This worked well for me,Not 5 stars because t...",Dave
7,5,best thing ever,"Go and buy this right now, it's amazing.",Pablo
8,5,amazing product,"Love everything about this product,it works g...",John
9,5,love this!!,Iwould 100% recommend this to everone.,Cara


And there you have it – our data set with all the fixins’. Or, rather, with all the fix-outs: We’ve made good use of intuitive Python libraries to locate and eliminate bad data, and standardize the rest. We are now ready to make the most of the Data in our AI and ML models.