# Data cleaning challenge: which product do people like best?

In this challenge, you will take the role of a data scientist. You'll be given some data on customer reviews for 3 products (Products A, B, and C) and you'll have to clean it to be able to run your company's graphing code to see which product is best.

### Necessary files:
* There is a file in the `datasets` folder called 'product_tests.csv'. This contains data from 100 customer ratings each of Products A, B, and C. Each customer has a unique user id and rated one of the products on a scale from 0-5. (0 is the worst, 5 is the best) 
* There is a script that runs your company's graphing code called `compare_products.py`. This script will make a graph to help figure out which product customers like best. **This script reads in a file called 'products_clean.csv' in the datasets folder. Your overall job is to clean the data to make this file!**


**First, import the `product_tests.csv` file using pandas and assign it to a variable** (remember to import pandas too)

In [3]:
import pandas as pd

In [9]:
product_test = pd.read_csv('../../datasets/product_tests.csv')
product_test

Unnamed: 0.1,Unnamed: 0,product,rating,user_id
0,0,Product A,4.340998,Y5JgC1
1,1,Product A,,GRHQYF
2,2,Product A,2.363216,EZ96Fa
3,3,Product A,,MzRCo4
4,4,Product A,4.987896,VnVWvM
...,...,...,...,...
295,95,Product C,4.332348,IkyryZ
296,96,Product C,4.531547,
297,97,Product C,3.733014,shIkm7
298,98,Product C,,4UFkhB


In [10]:
product_test.describe()

Unnamed: 0.1,Unnamed: 0,rating
count,300.0,294.0
mean,49.5,1.571008
std,28.914301,17.700559
min,0.0,-300.0
25%,24.75,1.528339
50%,49.5,2.531201
75%,74.25,3.630419
max,99.0,4.996956


### Your data cleaning goals:

Your goal is to make this 'products_clean.csv' file a cleaned datafile. Here are the steps you should take to make sure the data are clean

1. Remove any rows where ratings (values in the `rating` column) are below 0 or above 5. These would be impossible scores so these should be removed.

In [15]:
good_ratings = product_test[product_test.rating >= 0]
good_ratings

Unnamed: 0.1,Unnamed: 0,product,rating,user_id
0,0,Product A,4.340998,Y5JgC1
2,2,Product A,2.363216,EZ96Fa
4,4,Product A,4.987896,VnVWvM
5,5,Product A,0.256108,uyTYq1
6,6,Product A,0.254752,6hiPYk
...,...,...,...,...
294,94,Product C,2.183499,C3cTCd
295,95,Product C,4.332348,IkyryZ
296,96,Product C,4.531547,
297,97,Product C,3.733014,shIkm7


In [16]:
good_ratings.describe()

Unnamed: 0.1,Unnamed: 0,rating
count,293.0,293.0
mean,49.692833,2.60026
std,28.740833,1.365305
min,0.0,0.021793
25%,25.0,1.561531
50%,49.0,2.536626
75%,75.0,3.636347
max,99.0,4.996956


2. There are some rows where the user_id is missing. Replace these with the string 'unknown user' for each missing user_id. We don't know the user id, but maybe we can still analyze these data points!

In [17]:
good_ratings.isnull().sum()

Unnamed: 0    0
product       1
rating        0
user_id       3
dtype: int64

In [20]:
good_ratings['user_id'][good_ratings['user_id'].isnull()] = 'unknown user'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  good_ratings['user_id'][good_ratings['user_id'].isnull()] = 'unknown user'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


In [21]:
good_ratings.isnull().sum()

Unnamed: 0    0
product       1
rating        0
user_id       0
dtype: int64

3. Filter out any rows where `product` or `rating` are missing. We can't analyze data if we don't know which product it was, or what the rating was!

In [23]:
good_ratings = good_ratings[-good_ratings['product'].isnull()]
good_ratings

Unnamed: 0.1,Unnamed: 0,product,rating,user_id
0,0,Product A,4.340998,Y5JgC1
2,2,Product A,2.363216,EZ96Fa
4,4,Product A,4.987896,VnVWvM
5,5,Product A,0.256108,uyTYq1
6,6,Product A,0.254752,6hiPYk
...,...,...,...,...
294,94,Product C,2.183499,C3cTCd
295,95,Product C,4.332348,IkyryZ
296,96,Product C,4.531547,unknown user
297,97,Product C,3.733014,shIkm7


4. Rename the `rating` column to `user_rating` and the `product` column to `product_id`. The company's code is built to use these standardized column names

In [24]:
good_ratings.rename(columns = {'rating': 'user_rating',
                               'product': 'product_id'},
                    inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


In [25]:
good_ratings.head()

Unnamed: 0.1,Unnamed: 0,product_id,user_rating,user_id
0,0,Product A,4.340998,Y5JgC1
2,2,Product A,2.363216,EZ96Fa
4,4,Product A,4.987896,VnVWvM
5,5,Product A,0.256108,uyTYq1
6,6,Product A,0.254752,6hiPYk


5. Once you've done all these steps, export the data to `jtc_class_code/datasets/products_clean.csv`

Make sure that the csv is named exactly this way in your folder, because the graphing code relies on this exact file path!

In [26]:
good_ratings.to_csv('../../datasets/products_clean.csv')

### Comparing the products

Once you've finished, run:
```console 
$ python compare_products.py
``` 

from the command line, and if the code runs smoothly, you'll see a file called `product_chart.png` pop up to help you decide which product customers like best. 

Which product do you think is highest-rated?

If you don't get it on the first try, don't worry! Try to use the error messages you see, and take a look at your `products_clean.csv` file to see what is being output to help you guide your data cleaning process 

## Finished and got the plot? Decided which product is highest-rated? 

#### Congrats on finishing the data cleaning challenge! Data cleaning is not easy! 

So, remember to comment all your code and push this notebook to Github