# Pandas

## Dataset Description

- id - unique identifier of the establishment  
- address - physical address of the establishment
- categories - categories of the establishment (e.g. "Fast food restaurant")
- city - city where the establishment is located
- cuisines - types of cuisine practiced at the establishment (e.g. "Mexican")
- dateAdded - date the entry was added (assumed to be the same as the establishment's opening date)
- dateUpdated - date the information about the establishment was last updated
- latitude - geographic latitude
- longitude - geographic longitude
- menus:
    - category - categories of food on the menu
    - currency - currency in which payment is accepted
    - dateSeen - date the menu was recorded
    - description - menu description provided by the establishment
    - name - menu name
- name - name of the establishment
- province - province (state) where the establishment is located

## Tasks

### Task 1. 



Load data and view the first three lines of the column `dateAdded`. 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('05_data.csv')

In [3]:
df['dateAdded'].head(3)

0    2016-03-02T11:49:34Z
1    2016-03-02T11:49:34Z
2    2016-10-14T01:58:25Z
Name: dateAdded, dtype: object

### Task 2. 

Check data types of the columns in the dataset you have. 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77260 entries, 0 to 77259
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 77260 non-null  object 
 1   address            77260 non-null  object 
 2   categories         77260 non-null  object 
 3   city               77260 non-null  object 
 4   cuisines           38384 non-null  object 
 5   dateAdded          77260 non-null  object 
 6   dateUpdated        77260 non-null  object 
 7   latitude           55636 non-null  float64
 8   longitude          55636 non-null  float64
 9   menus.category     3729 non-null   object 
 10  menus.currency     40511 non-null  object 
 11  menus.dateSeen     77260 non-null  object 
 12  menus.description  29323 non-null  object 
 13  menus.name         77260 non-null  object 
 14  name               77257 non-null  object 
 15  province           77257 non-null  object 
dtypes: float64(2), object(

### Task 3. 

Apply `.describe()` method to the dataset. 

In [5]:
df.describe()

Unnamed: 0,latitude,longitude
count,55636.0,55636.0
mean,36.694846,-98.713309
std,4.835124,18.245857
min,-31.986438,-159.49269
25%,33.668355,-117.64715
50%,36.047195,-96.68232
75%,40.58838,-82.67993
max,61.21946,115.903696


### Task 4. 

Find mean values of the `latitude` and `longitude` columns. Round the result up to two digits. 

In [6]:
round(df['latitude'].mean(), 2)

36.69

In [7]:
round(df['longitude'].mean(), 2)

-98.71

### Task 5. 

Remove missing values from the dataset. What will be the size of the dataset after missing values removal?

In [8]:
df.dropna().shape

(1925, 16)

### Task 6. 

Filter data, only places based in `California` should remain after filtering. List indexes of these places. 

In [9]:
df[df['city'] == 'California'].index

Int64Index([40483, 52930, 52931, 52932, 52933, 52934, 52935, 52936, 52937,
            52938, 52939, 65070],
           dtype='int64')

### Task 7. 

Choose only `Taco Bell` locations in `California`. List indexes of these places.

In [10]:
df[(df['city'] == 'California') & (df['name'] == 'Taco Bell')].index

Int64Index([52930, 52931, 52932, 52933, 52934, 52935, 52936, 52937, 52938,
            52939],
           dtype='int64')

### Task 8. 

Please, find `Taco Bell` or any other places located in `New York`, but there should be no menus named as `Volcano Taco` and `Fresco Soft Taco`. Save the output into a new dataframe named `result`.

In [11]:
data = df.copy(deep=True)

In [12]:
result = data[
    (
        (data['name'] == 'Taco Bell') | (data['city'] == 'New York')
    )
    &
    (
        (data['menus.name'] != 'Volcano Taco') & (data['menus.name'] != 'Fresco Soft Taco')
    )
]

In [13]:
result.shape

(3756, 16)

### Task 9. 

Filter data where `menus.currency` is not NaN. 

In [14]:
mask = data['menus.currency'].isna()

In [15]:
result = data[~mask]

In [16]:
result.shape

(40511, 16)

### Task 10. 

Try to access `categories` column as a Series and as a DataFrame. 

In [17]:
data['categories']  # Series

0        Restaurant Delivery Service,Restaurants,Pizza,...
1        Restaurant Delivery Service,Restaurants,Pizza,...
2             Golf Course, American Restaurant, and Resort
3                                     Fast Food Restaurant
4        Mexican Restaurant Mid-City West,Mexican Resta...
                               ...                        
77255                                           Restaurant
77256                                           Restaurant
77257                                           Restaurant
77258                                           Restaurant
77259                                           Restaurant
Name: categories, Length: 77260, dtype: object

In [18]:
data[['categories']]  # DataFrame

Unnamed: 0,categories
0,"Restaurant Delivery Service,Restaurants,Pizza,..."
1,"Restaurant Delivery Service,Restaurants,Pizza,..."
2,"Golf Course, American Restaurant, and Resort"
3,Fast Food Restaurant
4,"Mexican Restaurant Mid-City West,Mexican Resta..."
...,...
77255,Restaurant
77256,Restaurant
77257,Restaurant
77258,Restaurant


### Task 11. 

List top-5 cities having the most records in the dataset. 

In [19]:
data.groupby('city').count().sort_values('id', ascending=False).head().index

Index(['San Diego', 'Los Angeles', 'Chicago', 'San Francisco', 'New York'], dtype='object', name='city')

### Task 12. 

Find how many `Taco Bell` restaurants are there in different cities. List the top-5 cities. 

In [20]:
result = data[data['name'] == 'Taco Bell']['city'].value_counts()[0:5]

In [21]:
result

Indianapolis        84
Columbus            63
Charleston          63
Tampa               62
South Lake Tahoe    42
Name: city, dtype: int64

### Task 13. 

Find restaurants that were opened in October. List indexes of the first 5 records as a result. 

In [22]:
data['dateAdded'] = pd.to_datetime(data['dateAdded'])

In [23]:
data[data['dateAdded'].dt.month == 10].head().index

Int64Index([2, 21, 22, 23, 24], dtype='int64')

### Task 14. 

Find out how many restaurants were opened in a particular month. 

In [24]:
result = data.groupby(data['dateAdded'].dt.month).agg({'id': 'nunique'})

In [25]:
result

Unnamed: 0_level_0,id
dateAdded,Unnamed: 1_level_1
1,308
2,257
3,4970
4,3224
5,1141
6,1356
7,645
8,479
9,554
10,4716


### Task 15. 

Find mean `update_delta` and max `latitude` for each city. 

In [26]:
data['dateUpdated'] = pd.to_datetime(data['dateUpdated'])

In [27]:
data['update_delta'] = (data['dateUpdated'] - data['dateAdded']).dt.days

In [28]:
result = data.groupby('city', as_index=False).agg({'update_delta': 'mean', 'latitude': 'max'})
result

Unnamed: 0,city,update_delta,latitude
0,Abbeville,114.857143,29.982108
1,Aberdeen,81.625000,46.975110
2,Abilene,206.454545,32.453090
3,Abingdon,303.500000,36.712800
4,Abington,393.000000,40.124851
...,...,...,...
3596,Zebulon,0.000000,33.102090
3597,Zephyr Cove,105.500000,38.984947
3598,Zephyrhills,329.695652,28.271183
3599,Zieglerville,621.000000,40.291611


### Task 16. 

Find mean `update_delta` for a city named `Zephyrhills`.

In [29]:
zep_mean = data[data['city'] == 'Zephyrhills']['update_delta'].mean()
zep_mean

329.69565217391306

### Task 17. 

Find an index of the third element that contains a word `Pizza` in the category column. 

In [30]:
mask = data['categories'].str.contains('Pizza')
data[mask].reset_index().iloc[2]

index                                                               66
id                                                AVwc_59U_7pvs4fz1Md3
address                                               3250 Kennedy Cir
categories           Pizza,Take Out Restaurants,Caterers,Restaurant...
city                                                           Dubuque
cuisines                                                         Pizza
dateAdded                                    2016-03-29 04:45:06+00:00
dateUpdated                                  2016-05-05 12:40:51+00:00
latitude                                                     42.505264
longitude                                                    -90.72141
menus.category                                                     NaN
menus.currency                                                     NaN
menus.dateSeen               2016-05-05T12:40:51Z,2016-03-29T04:45:06Z
menus.description                                                  NaN
menus.

### Task 18. 

Find mean and median values of the 'update_delta' column. 

In [31]:
round(data['update_delta'].mean(), 2)

326.22

In [32]:
round(data['update_delta'].median(), 2)

364.0

### Task 19. 

Filter out only places that have more than 20 items in `categories` column. Then group data by province and find minimal longitude for each province. Save results to a `csv` file. Do not forget to round the result of aggregation up to 3 digits. The file with the result should have only two colums: `province`, `longitude`. 

In [33]:
def category_len_cnt(category):
    return len(category.split(','))

In [34]:
data['category_len'] = data['categories'].apply(category_len_cnt)

In [35]:
result = data[data['category_len'] > 20].groupby('province', as_index=False).agg({'longitude': 'min'})

In [36]:
result['longitude'] = round(result.longitude, 3)

In [37]:
result.to_csv('result.csv', sep=';', index=False)