# Download the data and prepare it for analysis

## Project description
We decided to open a small robot-run cafe in Los Angeles. The project is promising but expensive, so  decide to try to attract investors. They’re interested in the current market conditions — will we be able to maintain your success when the novelty of robot waiters wears off?
We been asked to prepare some market research. We have open-source data on restaurants in LA.

## Import

In [79]:
!pip install usaddress

Collecting usaddress
  Downloading usaddress-0.5.10-py2.py3-none-any.whl (63 kB)
Collecting python-crfsuite>=0.7
  Downloading python_crfsuite-0.9.8-cp39-cp39-win_amd64.whl (158 kB)
Collecting probableparsing
  Downloading probableparsing-0.0.1-py2.py3-none-any.whl (3.1 kB)
Installing collected packages: python-crfsuite, probableparsing, usaddress
Successfully installed probableparsing-0.0.1 python-crfsuite-0.9.8 usaddress-0.5.10


In [80]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from scipy import stats as st
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats as st
import datetime as dt
import usaddress

%matplotlib inline

## Load data

In [23]:
try:
    rest_data = pd.read_csv('rest_data_us.csv', sep=',')
except:
    rest_data = pd.read_csv('/datasets/rest_data_us.csv', sep=',')

## Prepare data for analysis

- object_name — establishment name
- chain — chain establishment (TRUE/FALSE)
- object_type — establishment type
- address — address
- number — number of seats

In [24]:
rest_data.head()

Unnamed: 0,id,object_name,address,chain,object_type,number
0,11786,HABITAT COFFEE SHOP,3708 N EAGLE ROCK BLVD,False,Cafe,26
1,11787,REILLY'S,100 WORLD WAY # 120,False,Restaurant,9
2,11788,STREET CHURROS,6801 HOLLYWOOD BLVD # 253,False,Fast Food,20
3,11789,TRINITI ECHO PARK,1814 W SUNSET BLVD,False,Restaurant,22
4,11790,POLLEN,2100 ECHO PARK AVE,False,Restaurant,20


In [25]:
rest_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9651 entries, 0 to 9650
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           9651 non-null   int64 
 1   object_name  9651 non-null   object
 2   address      9651 non-null   object
 3   chain        9648 non-null   object
 4   object_type  9651 non-null   object
 5   number       9651 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 452.5+ KB


Check if id numbers are unique

In [26]:
rest_data['id'].nunique()

9651

Yes, id numbers are unique

Lets see how the chain diverse

In [27]:
rest_data['chain'].value_counts()

False    5972
True     3676
Name: chain, dtype: int64

Looks ok

Lets see if we have duplicates of same establishment in the same address

In [28]:
rest_data.groupby(['object_name', 'address'])['id'].count()

object_name                address                    
#1 CAFE                    2080 CENTURY PARK E STE 108    1
#1 CHINESE FAST FOOD       8606 S VERMONT AVE             1
#1 DONUT                   8509 S FIGUEROA ST             1
#1 DONUTS                  8509 S FIGUEROA ST # 106       1
#2 MOON BBQ                478 N WESTERN AVE              1
                                                         ..
ZULY'S 99 AND UP DISCOUNT  3326 S CENTRAL AVE             1
ZUMA KITCHEN               1942 WESTWOOD BLVD             1
ZWEET CAFE                 4682 EAGLE ROCK BLVD           1
ZWONNY KITCHEN INC         857 S WESTERN AVE              1
ZZAMONG                    4255 W 3RD ST                  1
Name: id, Length: 9651, dtype: int64

There are no more than one establishment in the same address. 

# Step 2. Data analysis

## Investigate the proportions of the various types of establishments. Plot a graph.


We'll plot a pie plot to see proportion

In [38]:
df = rest_data.groupby('object_type')['id'].count().reset_index()
fig = px.pie(df, names='object_type', values='id', title='Proportions of various types of establishments')
fig.show()

Most popular establishments is restaurant

## Investigate the proportions of chain and nonchain establishments. Plot a graph.


We'll plot a pie plot to see proportion

In [39]:
df = rest_data.groupby('chain')['id'].count().reset_index()
fig = px.pie(df, names='chain', values='id', title='Proportions of chain and nonchain establishments')
fig.show()

Larger amount of the establishments are non chain

## Which type of establishment is typically a chain?


In [51]:
df = rest_data.pivot_table(
    index=['object_type', 'chain'],
    values='id',
    aggfunc='count'
).reset_index()
df

Unnamed: 0,object_type,chain,id
0,Bakery,True,283
1,Bar,False,215
2,Bar,True,77
3,Cafe,False,169
4,Cafe,True,266
5,Fast Food,False,461
6,Fast Food,True,605
7,Pizza,False,166
8,Pizza,True,153
9,Restaurant,False,4961


Bakery is **always** a chain

## What characterizes chains: many establishments with a small number of seats or a few establishments with a lot of seats?

In [68]:
df = rest_data.query('chain == True').groupby(
    'object_name').agg({'id':'count', 'number':'mean'}).reset_index()
df.columns = ['object_name', 'number_of_restaurants', 'average_number_of_seats']
fig = px.scatter(df, x="number_of_restaurants", y="average_number_of_seats")
fig.update_layout(
    title="establishments characterize by number of restaurants and seats"
)
fig.show()



Chains characterize by few establishments with a lot of seats rather than many establishments with a small number of seats

## Determine the average number of seats for each type of restaurant. On average, which type of restaurant has the greatest number of seats? Plot graphs.

In [76]:
df = rest_data.groupby('object_type')['number'].mean().reset_index().sort_values(by='number')
fig = px.bar(df, x='object_type', y='number')
fig.update_layout(
    title="Number of seats for each type of restaurant",
    xaxis_title="type of restaurant",
    yaxis_title="average number of seats"   
)
fig.show()
df

Unnamed: 0,object_type,number
0,Bakery,21.773852
2,Cafe,25.0
4,Pizza,28.459375
3,Fast Food,31.837711
1,Bar,44.767123
5,Restaurant,48.042316


Restaurant has the greatest number of seats

## Put the data on street names from the address column in a separate column.

Use a function to take just the street name 

In [83]:
#example of my final code
def cleaning_final(raw):
    if raw.startswith('OLVERA'):
        clean_adress='OLVERA,Los Angeles,USA'
    elif raw.startswith('1033 1/2 LOS ANGELES'):
        clean_adress='1033 1/2 LOS ANGELES ST,Los Angeles,USA'
    else:
        raw_address=usaddress.parse(raw)
        dict_address={}
        for i in raw_address:
            dict_address.update({i[1]:i[0]})
        clean_adress=dict_address['AddressNumber']+" "+str(dict_address['StreetName'])+str(', Los Angeles,USA')
    return clean_adress

In [84]:
rest_data['clean_street_final']=rest_data.address.apply(cleaning_final)
rest_data.sample(10)

Unnamed: 0,id,object_name,address,chain,object_type,number,clean_street_final
736,12522,"SPECIALTY'S CAFE & BAKERY, INC",400 S HOPE ST STE 110,True,Bakery,90,"400 HOPE, Los Angeles,USA"
3166,14952,HAPPY TUESDAY,1883 DALY ST #102,False,Restaurant,5,"1883 DALY, Los Angeles,USA"
75,11861,VNS CHICKEN,3250 W OLYMPIC BLVD STE 105,False,Restaurant,29,"3250 OLYMPIC, Los Angeles,USA"
9140,20926,HOLY JUICE,2926 N BEVERLY GLEN CIR,False,Restaurant,9,"2926 GLEN, Los Angeles,USA"
1492,13278,YUPDDUK,3132 W OLYMPIC BLVD,False,Restaurant,16,"3132 OLYMPIC, Los Angeles,USA"
5635,17421,ROSCOES CHICKEN & WAFFLES,5006 W PICO BLVD,False,Restaurant,83,"5006 PICO, Los Angeles,USA"
7990,19776,THAI ORIGINAL BBQ,4055 W 3RD ST,False,Restaurant,92,"4055 3RD, Los Angeles,USA"
2164,13950,LE COMPTOIR,3606 W 6TH ST,False,Restaurant,9,"3606 6TH, Los Angeles,USA"
5104,16890,LA FAMOSITA,5116 S CENTRAL AVE,False,Restaurant,28,"5116 CENTRAL, Los Angeles,USA"
3677,15463,LOBBY CAFE BAR,1000 WILSHIRE BLVD,False,Bar,48,"1000 WILSHIRE, Los Angeles,USA"


Worked!

## Plot a graph of the top ten streets by number of restaurants.

In [97]:
df = rest_data.groupby(
    'clean_street_final')['id'].count().reset_index().sort_values(by='id', ascending=False).head(10)
fig = px.bar(df, x='clean_street_final', y='id')
fig.update_layout(
    title="Top ten streets by number of restaurants",
    xaxis_title="streets name",
    yaxis_title="number of restaurants"   
)
fig.show()

On the lead 6333 3RD and 10250 MONICA with 63 restaurants.

## Find the number of streets that only have one restaurant.

In [96]:
number_of_streets = rest_data.groupby(
    'clean_street_final'
)['id'].count().reset_index().sort_values(by='id').query('id == 1').shape[0]
f'There are {number_of_streets} streets with only one restaurant.'

'There are 5086 streets with only one restaurant.'

## For streets with a lot of restaurants, look at the distribution of the number of seats. What trends can you see?

In [107]:
# Let's split the data to have only information about the 10 most busy streets
busy_streets = rest_data.groupby(
    'clean_street_final')['id'].count().reset_index().sort_values(by='id', ascending=False).head(10)
# get array of the busy streets
array = busy_streets['clean_street_final'].to_list()
# filter df by these streets 
df = rest_data.loc[rest_data['clean_street_final'].isin(array)]
fig = px.histogram(df, x="number")
fig.update_layout(
    title="Distribution of the number of seats")
fig.show()


The majority of places have small amount of seats. Small places are trendy.

## Draw an overall conclusion and provide recommendations on restaurant type and number of seats. Comment on the possibility of developing a chain.

We don't have a measurement of success to the places we see on the data. We can only tell few characteristics like location, number of seats, part of a branch and type of restaurant. We will start from a premise that the  more common a particular type it is the more likely it is to be successful and otherwise there would have been few of it. So we will recommend to open a restaurant from type restaurant. Put between 45 to 50 seats. Since the majority of restaurant from type restaurant are not part of chain we will recommend not to develop in this direction

# Step 3. Preparing a presentation

Presentation: <https://drive.google.com/file/d/1_f2R1kbs2IUcsGmSahZlrs1iEfNcED5e/view?usp=sharing>