# Cleaning Data

**Prerequisites**

- [Intro](intro.ipynb)  
- [Boolean selection](basics.ipynb)  
- [Indexing](the_index.ipynb)  


**Outcomes**

- Be able to use string methods to clean data that comes as a string  
- Be able to drop missing data  
- Use cleaning methods to prepare and analyze a real dataset  


**Data**

- Item information from about 3,000 Chipotle meals from about 1,800
  Grubhub orders  

In [6]:
# Uncomment following line to install on colab
#! pip install qeds

In [7]:
import pandas as pd
import numpy as np
import qeds

In [8]:
df = pd.DataFrame({"numbers": ["#23", "#24", "#18", "#14", "#12", "#10", "#35"],
                   "nums": ["23", "24", "18", "14", np.nan, "XYZ", "35"],
                   "colors": ["green", "red", "yellow", "orange", "purple", "blue", "pink"],
                   "other_column": [0, 1, 0, 2, 1, 0, 2]})
df

Unnamed: 0,numbers,nums,colors,other_column
0,#23,23,green,0
1,#24,24,red,1
2,#18,18,yellow,0
3,#14,14,orange,2
4,#12,,purple,1
5,#10,XYZ,blue,0
6,#35,35,pink,2



<a id='exercise-1'></a>
> See exercise 2 in the [*exercise list*](#exerciselist-0)


<a id='exercise-2'></a>
> See exercise 3 in the [*exercise list*](#exerciselist-0)

In [26]:
chipotle = qeds.data.load("chipotle_raw")
chipotle.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98



<a id='exercise-3'></a>
> See exercise 4 in the [*exercise list*](#exerciselist-0)

## Exercises


<a id='exerciselist-0'></a>
**Exercise 1**

Convert the string below into a number.

In [32]:
c2n = "#39" 
int(c2n.replace("#","")) #have to remove the non number elements of the string, and transform that to int

39

([*back to text*](#exercise-0))

**Exercise 2**

Make a new column called `colors_upper` that contains the elements of
`colors` with all uppercase letters.

([*back to text*](#exercise-1))

**Exercise 3**

Convert the column `"nums"` to a numeric type using `pd.to_numeric` and
save it to the DataFrame as `"nums_tonumeric"`.

Notice that there is a missing value, and a value that is not a number.

Look at the documentation for `pd.to_numeric` and think about how to
overcome this.

Think about why this could be a bad idea of used without
knowing what your data looks like. (Think about what happens when you
apply it to the `"numbers"` column before replacing the `"#"`.)

([*back to text*](#exercise-2))

**Exercise 4**

We'd like you to use this data to answer the following questions.

- What is the average price of an item with chicken?  
- What is the average price of an item with steak?  
- Did chicken or steak produce more revenue (total)?  
- How many missing items are there in this dataset? How many missing
  items in each column?  


Hint: before you will be able to do any of these things you will need to
make sure the `item_price` column has a numeric `dtype` (probably
float)

([*back to text*](#exercise-3))

In [33]:
#EX2
df['colors_upper'] = df["colors"].str.upper() #create new column with the information asked
print(df) #check it was created correctly

  numbers nums  colors  other_column  numbers_loop numbers_str  \
0     #23   23   green             0          23.0          23   
1     #24   24     red             1          24.0          24   
2     #18   18  yellow             0          18.0          18   
3     #14   14  orange             2          14.0          14   
4     #12  NaN  purple             1          12.0          12   
5     #10  XYZ    blue             0          10.0          10   
6     #35   35    pink             2          35.0          35   

   numbers_numeric colors_upper  
0               23        GREEN  
1               24          RED  
2               18       YELLOW  
3               14       ORANGE  
4               12       PURPLE  
5               10         BLUE  
6               35         PINK  


In [34]:
#EX3
df['nums_tonumeric'] = pd.to_numeric(df['nums'],errors='coerce') #use coerce to have all the column with numeric values
print(df)

  numbers nums  colors  other_column  numbers_loop numbers_str  \
0     #23   23   green             0          23.0          23   
1     #24   24     red             1          24.0          24   
2     #18   18  yellow             0          18.0          18   
3     #14   14  orange             2          14.0          14   
4     #12  NaN  purple             1          12.0          12   
5     #10  XYZ    blue             0          10.0          10   
6     #35   35    pink             2          35.0          35   

   numbers_numeric colors_upper  nums_tonumeric  
0               23        GREEN            23.0  
1               24          RED            24.0  
2               18       YELLOW            18.0  
3               14       ORANGE            14.0  
4               12       PURPLE             NaN  
5               10         BLUE             NaN  
6               35         PINK            35.0  


In [35]:
#EX4
chipotle=qeds.data.load("chipotle_raw")
chipotle[chipotle.columns[4]] = chipotle[chipotle.columns[4]].replace('[\$,]', '', regex=True).astype(float)
df1=chipotle[chipotle['item_name'].str.contains("Chicken")]
print('The average price of an item with chicken is:')
df1["item_price"].mean() 

The average price of an item with chicken is:


10.133724358974309

In [36]:
df2=chipotle[chipotle['item_name'].str.contains("Steak")]
print('The average price of an item with steak is:')
df2["item_price"].mean() 

The average price of an item with steak is:


10.518888888888851

In [37]:
print(df1["item_price"].sum()) #chicken
print(df2["item_price"].sum()) #steak
print('We can see that chicken produced more revenue')

15808.61
7384.26
We can see that chicken produced more revenue


In [38]:
print(" \nCount total NaN at each column in a Chipotle data : \n\n", 
      chipotle.isnull().sum()) 
print('1246 missing items total; all in the choice_description column')

 
Count total NaN at each column in a Chipotle data : 

 order_id                 0
quantity                 0
item_name                0
choice_description    1246
item_price               0
dtype: int64
1246 missing items total; all in the choice_description column
