# Ex2 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [2]:
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

### Step 3. Assign it to a variable called chipo.

In [3]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
chipo = pd.read_csv(url, sep='\t')
#chipo = pd.read_csv('data/chipotle_small.tsv', sep='\t')

### Step 4. See the first 10 entries

In [4]:
chipo.head(10)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25


### Step 5. What is the number of observations in the dataset?

In [5]:
# Solution 1
numrows, numcolumns = chipo.shape
numrows

4622

In [6]:
# Solution 2

chipo.info

<bound method DataFrame.info of       order_id  quantity                              item_name  \
0            1         1           Chips and Fresh Tomato Salsa   
1            1         1                                   Izze   
2            1         1                       Nantucket Nectar   
3            1         1  Chips and Tomatillo-Green Chili Salsa   
4            2         2                           Chicken Bowl   
...        ...       ...                                    ...   
4617      1833         1                          Steak Burrito   
4618      1833         1                          Steak Burrito   
4619      1834         1                     Chicken Salad Bowl   
4620      1834         1                     Chicken Salad Bowl   
4621      1834         1                     Chicken Salad Bowl   

                                     choice_description item_price  
0                                                   NaN     $2.39   
1                        

### Step 6. What is the number of columns in the dataset?

In [7]:
numcolumns

5

### Step 7. Print the name of all the columns.

In [8]:
chipo.columns

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

### Step 8. How is the dataset indexed?

In [9]:
# Indexed by the the row order and numbered 0, 1, 2 ...
chipo.index

RangeIndex(start=0, stop=4622, step=1)

### Step 9. Which was the most-ordered item? 

In [10]:
df_most_ordered = chipo[['item_name', 'quantity']]
df_most_ordered_grp = df_most_ordered.groupby(['item_name']).sum()
df_most_ordered_grp.sort_values('quantity', ascending=False, inplace=True)
df_most_ordered_grp.index[0]
# most ordered item is:

'Chicken Bowl'

### Step 10. For the most-ordered item, how many items were ordered?

In [11]:
df_most_ordered_grp.loc[df_most_ordered_grp.index[0]]
# total quantity of most ordered item is:

quantity    761
Name: Chicken Bowl, dtype: int64

### Step 11. What was the most ordered item in the choice_description column?

In [12]:
choice_desc_quantity = chipo[['choice_description', 'quantity']]
choice_desc_quantity_grp = choice_desc_quantity.groupby('choice_description').sum()
choice_desc_quantity_grp.sort_values('quantity', ascending=False, inplace=True)
choice_desc_quantity_grp.index[0]

'[Diet Coke]'

### Step 12. How many items were orderd in total?

In [13]:
chipo['quantity'].sum()

4972

### Step 13. Turn the item price into a float

#### Step 13.a. Check the item price type

In [14]:
chipo['item_price'].dtype

dtype('O')

#### Step 13.b. Create a lambda function and change the type of item price

In [15]:
chipo['item_price'] = chipo['item_price'].apply(lambda x: float(x[1:]))
chipo['item_price']

0        2.39
1        3.39
2        3.39
3        2.39
4       16.98
        ...  
4617    11.75
4618    11.75
4619    11.25
4620     8.75
4621     8.75
Name: item_price, Length: 4622, dtype: float64

#### Step 13.c. Check the item price type

In [16]:
chipo['item_price'].dtype

dtype('float64')

### Step 14. How much was the revenue for the period in the dataset?

In [17]:
# Note: the item price already accounts for the quantity ie item price = quantity x unit item price
# for example, see item [Diet Coke]
filt = (chipo['choice_description'] == '[Diet Coke]')
df_example = chipo[filt].set_index('quantity').loc[[1]].head(1)
df_example = df_example.append(chipo[filt].set_index('quantity').loc[2].head(1))
df_example

Unnamed: 0_level_0,order_id,item_name,choice_description,item_price
quantity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,89,Canned Soda,[Diet Coke],1.09
2,73,Canned Soda,[Diet Coke],2.18


In [18]:
# we have shown that the item price already accounts for the quantity, therefore the revenue is the sum of the item_price column
chipo['item_price'].sum()

34500.16

### Step 15. How many orders were made in the period?

In [19]:
chipo['order_id'].max()

1834

### Step 16. What is the average revenue amount per order?

In [20]:
# Solution 1

order_q_price = chipo[['order_id', 'quantity', 'item_price']]
order_q_price_grp = order_q_price.groupby(['order_id']).sum()
mean_price_per_order = order_q_price_grp['item_price']/order_q_price_grp['quantity']

In [21]:
# Solution 2
mean_price_per_order

order_id
1        2.890000
2        8.490000
3        6.335000
4       10.500000
5        6.850000
          ...    
1830    11.500000
1831     4.300000
1832     6.600000
1833    11.750000
1834     9.583333
Length: 1834, dtype: float64

### Step 17. How many different items are sold?

In [22]:
chipo['item_name'].value_counts().size

50

In [23]:
order_q_price

Unnamed: 0,order_id,quantity,item_price
0,1,1,2.39
1,1,1,3.39
2,1,1,3.39
3,1,1,2.39
4,2,2,16.98
...,...,...,...
4617,1833,1,11.75
4618,1833,1,11.75
4619,1834,1,11.25
4620,1834,1,8.75


In [24]:
filt = (order_q_price['order_id'] == 2)
order_q_price[filt]

Unnamed: 0,order_id,quantity,item_price
4,2,2,16.98


In [29]:
order_q_price[order_q_price['quantity'] >= 2]

Unnamed: 0,order_id,quantity,item_price
4,2,2,16.98
18,9,2,2.18
51,23,2,2.18
135,60,2,22.50
148,67,2,17.98
...,...,...,...
4491,1786,4,5.00
4499,1789,2,2.50
4560,1812,2,2.50
4561,1813,2,17.50


In [30]:
order_q_price.loc[order_q_price['quantity'] >= 4, 'item_price']

1254    35.00
1257    11.80
1425     6.00
1880     6.00
2235     4.36
2441     7.50
3598    44.25
3599    10.50
3602    35.00
3887    13.52
3973     5.00
4152    15.00
4489    17.80
4490     5.00
4491     5.00
Name: item_price, dtype: float64