<a href="https://colab.research.google.com/github/liadekel/analyzing-big-data/blob/master/HW_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework Assignment 1
### [The Art of Analyzing Big Data - The Data Scientist’s Toolbox](https://www.ise.bgu.ac.il/labs/fire/lectures.html)
#### By Dr. Michael Fire 

For this homework you will need to write code that analyzes real-world datasets. The code needs to be written in Python using the [sqlite3](https://docs.python.org/2/library/sqlite3.html) package. 

**Please note:** You need to answer only the questions that match your ID first digit.

# 1. Babies Names Dataset (35pt)

**Task 1 (for everyone):** Write a code that uses the  [**babies names dataset**](https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-data-by-state-and-district-of-#topic=developers_navigation) and creates a table named (Names) with the dataset data and the following columns: 'State', 'Gender', 'Name', 'Number', and 'Year' (5pt)
**Bonus:** Load the data using a Batch INSERT SQL Query (2pt)

In [0]:
# restart state
! rm -rf ./datasets

In [0]:
# Creating a dataset directory
!mkdir ./datasets
!mkdir ./datasets/us-baby-name
# download the dataset using wget
!wget --directory-prefix ./datasets/us-baby-name https://www.ssa.gov/oact/babynames/state/namesbystate.zip
!unzip ./datasets/us-baby-name/*.zip  -d ./datasets/us-baby-name/namesbystate
# concatenate to one file
!cat ./datasets/us-baby-name/namesbystate/*.TXT >> ./datasets/us-baby-name/namesbystate.txt

mkdir: cannot create directory ‘./datasets’: File exists
--2020-03-25 10:05:39--  https://www.ssa.gov/oact/babynames/state/namesbystate.zip
Resolving www.ssa.gov (www.ssa.gov)... 137.200.4.16, 2001:1930:d07::aaaa
Connecting to www.ssa.gov (www.ssa.gov)|137.200.4.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21974087 (21M) [application/zip]
Saving to: ‘./datasets/us-baby-name/namesbystate.zip’


2020-03-25 10:06:48 (318 KB/s) - ‘./datasets/us-baby-name/namesbystate.zip’ saved [21974087/21974087]

Archive:  ./datasets/us-baby-name/namesbystate.zip
  inflating: ./datasets/us-baby-name/namesbystate/AK.TXT  
  inflating: ./datasets/us-baby-name/namesbystate/AL.TXT  
  inflating: ./datasets/us-baby-name/namesbystate/AR.TXT  
  inflating: ./datasets/us-baby-name/namesbystate/AZ.TXT  
  inflating: ./datasets/us-baby-name/namesbystate/CA.TXT  
  inflating: ./datasets/us-baby-name/namesbystate/CO.TXT  
  inflating: ./datasets/us-baby-name/namesbystate/CT.TXT  
  in

In [0]:
import sqlite3
import matplotlib
import matplotlib.pyplot as plt
import os
%matplotlib inline
TEXT_PATH = './datasets/us-baby-name/namesbystate.txt'
DB_PATH = './datasets/us-baby-name/namebystate.sqlite'

In [0]:
conn = sqlite3.connect(DB_PATH) # connecting to the database
c = conn.cursor() # creating a cursor object
# create Names table
c.execute(
        '''CREATE TABLE IF NOT EXISTS Names
             ([State] text,
              [Gender] text,
              [Year] integer,
              [Name] text,
              [Number] integer)
        '''
)

#load data into convinient format
with open(TEXT_PATH) as f:
    names = f.readlines()
names = [tuple(name.strip().split(",")) for name in names]

# insert data to db
c.executemany(
    '''INSERT INTO Names(State, Gender, Year, Name, Number)
       values (?,?,?,?,?)
    ''', names
)
# show all rows in Names
#c.execute("SELECT * FROM Names").fetchall()

<sqlite3.Cursor at 0x7f87c8c20d50>

**Task 2 (for everyone):** Write a query that returns the statistics for the name William (5pt). Use the [the timeit package](https://docs.python.org/3/library/timeit.html) to measure the time it takes the query to run (5pt). **Bonus:** [Create an index](https://www.w3schools.com/sql/sql_create_index.asp)  on the _Name_ column and use the [the timeit package](https://docs.python.org/3/library/timeit.html) to measure the time it takes the query to run with the index (5pt)

In [0]:
import time
import timeit

def test():
    query = "SELECT COUNT(*) FROM Names WHERE Name='William'"
    print("There are {} williams".format(c.execute(query).fetchone()[0]))

c.execute("DROP INDEX IF EXISTS idx_name")
print("Time without index:{}".format(timeit.timeit("test()", globals=globals(), number=1)))

c.execute('''CREATE INDEX IF NOT EXISTS idx_name
             ON Names (name);''')
print("Time with index:{}".format(timeit.timeit("test()", globals=globals(), number=1)))

There are 6726 williams
Time without index:0.44438184000000547
There are 6726 williams
Time with index:0.0006709849999424478


### <span style="color:red"> Please answer only **one** of the following questions according to your ID number (use the formula **<YOUR_ID> mod 4 +1**) </span>

In [0]:
# which question to answer - put your ID number and run the code 
your_id  = "316460443"
q = int(your_id) % 4 + 1
print("You need to answer question number %s" % q)

You need to answer question number 4


***Question 1:*** Write a function that returns how many babies were born in a given state in a given year.
Use it to calculate the number of babies born in LA in 1950 (10pt)

***Question 2:*** Write a function that returns how many male babies were born between a given range of years.
Use it to calculate how many babies were born between 1970 and 1975  (10pt)

**Question 3:** Write a function that returns the most common female name in a given state. Use it to calculate the most common female name in Wasington in 1987 (10pt)

**Question 4:** Write a function that returns how many male babies named _William_ where born in a given state in a given year. Use it to find the state in which the highest number of babies _William_ where born in 1999 (10pt)

In [0]:
def find_babies(year, state=None, gender='M', name='William'):
    num = c.execute(
    f'''SELECT COUNT(*)
        FROM Names
        WHERE State='{state}' AND
              Gender='{gender}'AND
              Name='{name}' AND
              Year={year}
    '''
    ).fetchone()[0]
    print(f'Number of babies with name={name}, year={year}, gender={gender} is {num} in state={state}')

def find_babies_max(year, state=None, gender='M', name='William'):
    num, state = c.execute(
    f'''SELECT COUNT(*), State
        FROM Names
        WHERE Gender='{gender}'AND
              Name='{name}' AND
              Year={year}
        ORDER BY 1 DESC
    '''
    ).fetchone()
    print(f'Max number of babies with name={name}, year={year}, gender={gender} is {num} in state={state}')


find_babies(year='1999', state='AK')
# implemented more efficiently using single query
# could also be implemented with for loop
find_babies_max(year=1999)

Number of babies with name=William, year=1999, gender=M is 1 in state=AK
Max number of babies with name=William, year=1999, gender=M is 51 in state=WY


**Question (for everyone):** For the state of NY write code that calculates the second most popular female/male names in each decade (10pt). **Bonus**: Visualize it somehow using Matplotlib (5pt)

In [0]:
def find_popular_in_decade(decade, gender, state='NY'):
    print(c.execute(
    f'''SELECT SUBSTR(Year, 1, 3) || '0', Gender, Name, COUNT(*)
        FROM Names
        WHERE State='{state}' AND
              Gender='{gender}' AND
              (SUBSTR(Year, 1, 3) || '0')='{decade}'
        GROUP BY SUBSTR(Year, 1, 3), Name
        ORDER BY 4 DESC LIMIT 2
    '''
    ).fetchall()[1]) # second on descending order


for decade in range(1910, 2010, 10):
    find_popular_in_decade(decade=decade, gender='M')
    find_popular_in_decade(decade=decade, gender='F')

('1910', 'M', 'Abe', 10)


# 2. Flavors of Cacao Dataset (15pt)

Using the [Flavors of Cacao](https://www.kaggle.com/rombikuboktaeder/choco-flavors) dataset, answer the following questions:

**Question 1:** Write a function that returns the number of bars manufactured where the bars' BroadBean Origin is a given country. Use the function to calculate the number of bars where BroadBean Origin is 'Fiji' (15pt)

**Question 2:** Write a function that returns the maximal and average cocoa percentage in a bar manufactured by a company in a specific country. Use the function to calculate the maximal and average cocoa percentage in bars manufactured by a Swiss company (15pt).

**Question 3:** Calculate the second most common bean type(s) and the most rare bean type(s) (15
pt)

**Question 4:** Calculate the number of reviews and the average rating in each year. Calculate the number of reviews and the average rating of each company in each year (15pt)

In [2]:
!mkdir /root/.kaggle/
import json
import os

# Installing the Kaggle package
!pip install kaggle 

#Important Note: complete this with your own key - after running this for the first time remmember to **remove** your API_KEY
api_token = {"username":"liaddekel","key":"f108a5e28c6e44704d469f7ae7614d16"}


# creating kaggle.json file with the personal API-Key details 
# You can also put this file on your Google Drive
with open('/root/.kaggle/kaggle.json', 'w') as file:
  json.dump(api_token, file)
!chmod 600 /root/.kaggle/kaggle.json

mkdir: cannot create directory ‘/root/.kaggle/’: File exists


In [0]:
# download and unzip dataset
!kaggle datasets list -s choco_flavors
!kaggle datasets download rombikuboktaeder/choco-flavors -p ./datasets/choco-flavors/
!unzip ./datasets/choco-flavors/choco-flavors.zip -d ./datasets/choco-flavors/

ref                             title          size  lastUpdated          downloadCount  
------------------------------  -------------  ----  -------------------  -------------  
rombikuboktaeder/choco-flavors  choco_flavors  30KB  2018-04-01 04:36:29            522  
Downloading choco-flavors.zip to ./datasets/choco-flavors
  0% 0.00/30.3k [00:00<?, ?B/s]
100% 30.3k/30.3k [00:00<00:00, 26.2MB/s]
Archive:  ./datasets/choco-flavors/choco-flavors.zip
  inflating: ./datasets/choco-flavors/flavors_of_cacao.csv  


In [3]:
!pip install pony

Collecting pony
[?25l  Downloading https://files.pythonhosted.org/packages/48/e4/45fa6185e86edfa97eef5a4fbe3f29f537de7f36c032ff7c54676310dcb1/pony-0.7.13.tar.gz (284kB)
[K     |█▏                              | 10kB 23.3MB/s eta 0:00:01[K     |██▎                             | 20kB 3.6MB/s eta 0:00:01[K     |███▌                            | 30kB 5.2MB/s eta 0:00:01[K     |████▋                           | 40kB 3.3MB/s eta 0:00:01[K     |█████▊                          | 51kB 4.0MB/s eta 0:00:01[K     |███████                         | 61kB 4.8MB/s eta 0:00:01[K     |████████                        | 71kB 5.4MB/s eta 0:00:01[K     |█████████▏                      | 81kB 6.1MB/s eta 0:00:01[K     |██████████▍                     | 92kB 6.8MB/s eta 0:00:01[K     |███████████▌                    | 102kB 5.3MB/s eta 0:00:01[K     |████████████▋                   | 112kB 5.3MB/s eta 0:00:01[K     |█████████████▉                  | 122kB 5.3MB/s eta 0:00:01[K     |

In [0]:
from pony.orm import *
# Creating a new database
db = Database()
db.bind(provider='sqlite', filename='/content/datasets/choco-flavors/choco-flavors.pony.db', create_db=True)

class ChocoFlavor(db.Entity):
    Company = Optional(str)
    BeanOrigin = Required(str)
    REF = Required(int)
    ReviewDate = Required(int)
    CocoaPrecent = Required(str)
    Location = Required(str)
    Rating = Required(float)
    BeanType = Optional(str)
    BroadBeanOrigin = Optional(str)
     
#set_sql_debug(True) # helps to see what SQL commands are running
db.generate_mapping(create_tables=True) # create tables

In [0]:
! rm -rf '/content/datasets/choco-flavors/choco-flavors.pony.db'

In [0]:
import pandas
import numpy as np

# pandas can find csv inside the zip itself
df = pandas.read_csv("./datasets/choco-flavors/choco-flavors.zip")
df = df.replace(np.nan, '', regex=True)

for idx, row in df.iterrows():
    ChocoFlavor(
            Company=row['Company\xa0\n(Maker-if known)'],
            BeanOrigin=row['Specific Bean Origin\nor Bar Name'],
            REF=row['REF'],
            ReviewDate=row['Review\nDate'],
            CocoaPrecent=row['Cocoa\nPercent'],
            Location=row['Company\nLocation'],
            Rating=row['Rating'],
            BeanType=row['Bean\nType'],
            BroadBeanOrigin=row['Broad Bean\nOrigin']
    )
show(ChocoFlavor)
commit()

class ChocoFlavor(Entity):
    id = PrimaryKey(int, auto=True)
    Company = Optional(str, default='')
    BeanOrigin = Required(str)
    REF = Required(int)
    ReviewDate = Required(int)
    CocoaPrecent = Required(str)
    Location = Required(str)
    Rating = Required(float)
    BeanType = Optional(str, default='')
    BroadBeanOrigin = Optional(str, default='')


In [0]:
# number of reviews and average rating per year
"""
    SELECT "c"."ReviewDate", COUNT(DISTINCT "c"."id"), AVG("c"."Rating")
    FROM "ChocoFlavor" "c"
    GROUP BY "c"."ReviewDate
"""
list(select((c.ReviewDate, count(c), avg(c.Rating)) for c in ChocoFlavor))

# number of reviews and average rating per company per year
"""
    SELECT "c"."Company", COUNT(DISTINCT "c"."id"), AVG("c"."Rating")
    FROM "ChocoFlavor" "c"
    GROUP BY "c"."Company"
"""
list(select((c.Company, c.ReviewDate, count(c), avg(c.Rating)) for c in ChocoFlavor))

SELECT "c"."ReviewDate", COUNT(DISTINCT "c"."id"), AVG("c"."Rating")
FROM "ChocoFlavor" "c"
GROUP BY "c"."ReviewDate"

SELECT "c"."Company", "c"."ReviewDate", COUNT(DISTINCT "c"."id"), AVG("c"."Rating")
FROM "ChocoFlavor" "c"
GROUP BY "c"."Company", "c"."ReviewDate"



[('A. Morin', 2012, 2, 3.625),
 ('A. Morin', 2013, 11, 3.3181818181818183),
 ('A. Morin', 2014, 5, 3.5),
 ('A. Morin', 2015, 4, 3.1875),
 ('A. Morin', 2016, 1, 3.75),
 ('AMMA', 2010, 4, 3.5625),
 ('AMMA', 2013, 1, 3.25),
 ('Acalli', 2015, 2, 3.75),
 ('Adi', 2011, 4, 3.25),
 ('Aequare (Gianduja)', 2009, 2, 2.875),
 ('Ah Cacao', 2009, 1, 3.0),
 ("Akesson's (Pralus)", 2010, 2, 2.75),
 ("Akesson's (Pralus)", 2011, 1, 3.75),
 ('Alain Ducasse', 2013, 2, 2.5),
 ('Alain Ducasse', 2014, 3, 2.8333333333333335),
 ('Alexandre', 2017, 4, 3.5),
 ('Altus aka Cao Artisan', 2013, 5, 3.0),
 ('Altus aka Cao Artisan', 2016, 5, 2.7),
 ('Amano', 2007, 3, 3.4166666666666665),
 ('Amano', 2008, 1, 2.75),
 ('Amano', 2009, 1, 3.0),
 ('Amano', 2010, 3, 3.5833333333333335),
 ('Amano', 2011, 1, 4.0),
 ('Amatller (Simon Coll)', 2009, 4, 2.875),
 ('Amazona', 2013, 2, 3.375),
 ('Ambrosia', 2015, 6, 3.25),
 ('Amedei', 2006, 2, 4.5),
 ('Amedei', 2007, 10, 3.725),
 ('Amedei', 2012, 1, 3.75),
 ('Anahata', 2014, 1, 3.0),
 

In [0]:
! rm -rf choco_flavors


# 3. Kickstarter Projects Dataset (25pt)

Using the [Kickstarter Projects Dataset](https://www.kaggle.com/kemical/kickstarter-projects#ks-projects-201801.csv), answer the following questions:

**Task 1 (for everyone):** Load the dataset to SQLite DB using [PonyORM](https://ponyorm.org) (10pt)

In [0]:
!kaggle datasets list -s kickstarter-projects
!kaggle datasets download kemical/kickstarter-projects -p ./datasets/kickstarter-projects/
!unzip ./datasets/kickstarter-projects/*.zip  -d ./datasets/kickstarter-projects/

ref                                              title                                        size  lastUpdated          downloadCount  
-----------------------------------------------  ------------------------------------------  -----  -------------------  -------------  
kemical/kickstarter-projects                     Kickstarter Projects                         37MB  2018-02-08 09:02:30          35064  
codename007/funding-successful-projects          Funding Successful Projects on Kickstarter   20MB  2017-06-20 17:37:38           2419  
socathie/kickstarter-project-statistics          Kickstarter Project Statistics                1MB  2019-11-14 06:38:31           5336  
toshimelonhead/400000-kickstarter-projects       400,000 Kickstarter Projects                   0B  2019-07-23 01:23:31            145  
uysalah/archived-kickstarter-projects            Archived Kickstarter Projects                 1MB  2019-05-10 04:33:22            125  
yashkantharia/kickstarter-campaigns      

In [0]:
! rm -rf /content/datasets/kickstarter-projects/kickstarter-project.pony.db

In [0]:
from pony.orm import *
# Creating a new database
db_ks = Database()
db_ks.bind(provider='sqlite', filename='/content/datasets/kickstarter-projects/kickstarter-project.pony.db', create_db=True)

class KickstarterProject(db_ks.Entity):
    name = Optional(str) 
    category = Required(str)
    main_category = Required(str)
    currency = Required(str)
    deadline = Required(str)
    goal = Required(float)
    launched = Required(str)
    pledged = Required(float)
    launched = Required(str)
    state = Required(str)
    backers = Required(int)
    country = Required(str)
    usd_pledged = Optional(float)
    usd_pledged_real = Required(float)
    usd_goal_real = Required(float)

show(KickstarterProject)
#set_sql_debug(True) # helps to see what SQL commands are running
db_ks.generate_mapping(create_tables=True) # create tables

class KickstarterProject(Entity):
    id = PrimaryKey(int, auto=True)
    name = Optional(str, default='')
    category = Required(str)
    main_category = Required(str)
    currency = Required(str)
    deadline = Required(str)
    goal = Required(float)
    pledged = Required(float)
    launched = Required(str)
    state = Required(str)
    backers = Required(int)
    country = Required(str)
    usd_pledged = Optional(float)
    usd_pledged_real = Required(float)
    usd_goal_real = Required(float)


In [0]:
import pandas
import numpy as np

df_ks = pandas.read_csv("./datasets/kickstarter-projects/ks-projects-201801.csv")
df_ks['name'].fillna("", inplace=True)
df_ks['usd pledged'].fillna(0.0, inplace=True)

for idx, row in df_ks.iterrows():
    KickstarterProject(
            name=(row['name'] if not type(row['name']) is float else ""),
            category=row['category'],
            main_category=row['main_category'],
            currency=row['currency'],
            deadline=row['deadline'],
            goal=row['goal'],
            launched=row['launched'],
            pledged=row['pledged'],
            state=row['state'],
            backers=row['backers'],
            country=row['country'],
            usd_pledged=row['usd pledged'],
            usd_pledged_real=row['usd_pledged_real'],
            usd_goal_real=row['usd_goal_real'],
    )
commit()

### <span style="color:red"> Please answer only **one** of the following questions according to your ID number (use the formula **<YOUR_ID> mod 3 +1**) </span>

In [0]:
# which question to answer - put your ID number and run the code 
your_id  = "316460443"
q = int(your_id) % 3 + 1
print("You need to answer question number %s" % q)

You need to answer question number 2


**Question 1:** On average which project category received the highest number of backers? (15 pt)

total 4
drwxr-xr-x 1 root root 4096 Mar 18 16:23 sample_data


**Question 2:** On average which project category received the highest pledged USD? (15 pt)

In [0]:
"""
    SELECT "p"."category", AVG("p"."usd_pledged")
    FROM "KickstarterProject" "p"
    GROUP BY "p"."category"
"""
avg_pledge = select((p.category, avg(p.usd_pledged)) for p in KickstarterProject)

"""
    ORDER BY AVG("p"."usd_pledged") DESC
    LIMIT 1
"""
desc_avg_pledge = avg_pledge.order_by(lambda x,y: desc(y)).limit(1)
list(desc_avg_pledge)

SELECT "p"."category", AVG("p"."usd_pledged")
FROM "KickstarterProject" "p"
GROUP BY "p"."category"
ORDER BY AVG("p"."usd_pledged") DESC
LIMIT 1



[('3D Printing', 52027.158213762814)]

**Question 3:** In which month occurred the highest number of projects? (15 pt)

## 4. Oscars Datasets (10pt)

Using the [Oscars Dataset](https://www.kaggle.com/theacademy/academy-awards), please answer only one of the following questions (you can chose):

**Question 1:** Who is the female actress with the most Oscar nominees? (10pt)

**Question 2:** Who is the male director with the most Oscar nominees? (10pt)

In [4]:
# download and unzip dataset
!kaggle datasets list -s academy-awards
!kaggle datasets download theacademy/academy-awards -p ./datasets/academy-awards/
!unzip ./datasets/academy-awards/academy-awards.zip -d ./datasets/academy-awards/

ref                                                     title                                                size  lastUpdated          downloadCount  
------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  
theacademy/academy-awards                               The Academy Awards, 1927-2015                       185KB  2017-02-13 17:30:48           4881  
fmejia21/demographics-of-academy-awards-oscars-winners  Demographics of Academy Awards (Oscars) Winners      20KB  2020-02-04 17:38:26           2138  
unanimad/golden-globe-awards                            Golden Globe Awards, 1944 - 2020                    117KB  2020-01-06 16:19:01           1551  
unanimad/the-oscar-award                                The Oscar Award, 1927 - 2020                        191KB  2020-02-19 15:45:30            552  
madhurinani/oscars-2017-tweets                          2017 #Oscars Tweets             

In [0]:
! rm -rf /content/datasets/academy-awards/academy-awards.pony.db

In [6]:
from pony.orm import *
import pandas

# Creating a new database
db_aa = Database()
db_aa.bind(provider='sqlite', filename='/content/datasets/academy-awards/academy-awards.pony.db', create_db=True)

class AcademyAward(db_aa.Entity):
    Year = Required(str) 
    Ceremony = Required(int)
    Award = Required(str)
    Winner = Optional(float)
    Name = Required(str)
    Film = Optional(str)

show(AcademyAward)
#set_sql_debug(True)
db_aa.generate_mapping(create_tables=True)

class AcademyAward(Entity):
    id = PrimaryKey(int, auto=True)
    Year = Required(str)
    Ceremony = Required(int)
    Award = Required(str)
    Winner = Optional(float)
    Name = Required(str)
    Film = Optional(str, default='')


In [0]:
import pandas
import numpy as np

df_aa = pandas.read_csv("./datasets/academy-awards/database.csv")
df_aa['Winner'].fillna(0.0, inplace=True)

for idx, row in df_aa.iterrows():
    AcademyAward(
            Year=row['Year'],
            Ceremony=row['Ceremony'],
            Award=row['Award'],
            Winner=row['Winner'],
            Name=row['Name'],
            Film=(row['Film'] if not type(row['Film']) is float else "")
    )
commit()

In [10]:
"""
    SELECT "a"."Film", COUNT(DISTINCT "a"."id")
    FROM "AcademyAward" "a"
    WHERE "a"."Award" = 'Directing'
    GROUP BY "a"."Film"
    ORDER BY COUNT(DISTINCT "a"."id") DESC
    LIMIT 1
"""
# The query suppose to use GRPOUP BY "a"."Name" since we would like to create groups of the same person
# But there is an error in the database, in some rows the data in 'Film' and 'Name' are replaced
# hence in order to get the correct answer the query need to use "a"."Film".
most_award = select((a.Film, count(a)) for a in AcademyAward).where(lambda a: a.Award == 'Directing')
most_award_desc = most_award.order_by(lambda x,y: desc(y)).limit(1)
list(most_award_desc)

[('William Wyler', 12)]

**Question 3:** Which top-10 movies received the highest number of Oscar nominees? (10pt)

**Question 4:** Write a function that receives an actor's name and returns the actor’s number of Oscar nominees. Use the function to calculate the number of times Leonardo DiCaprio was a nominee (10pt)

## 5. Select a Dataset (15pt)

**Open Question:** Select an interesting dataset and use SQL to discover something interesting (15pt). **Bonus:** Use BigQuery (2pt)

In [0]:
# download and unzip dataset
!kaggle datasets download shivamb/netflix-shows -p ./datasets/netflix-shows/
!unzip ./datasets/netflix-shows/netflix-shows.zip -d ./datasets/netflix-shows/

netflix-shows.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  ./datasets/netflix-shows/netflix-shows.zip
replace ./datasets/netflix-shows/netflix_titles.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: מ
error:  invalid response [מ]
replace ./datasets/netflix-shows/netflix_titles.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: מ
error:  invalid response [מ]
replace ./datasets/netflix-shows/netflix_titles.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [0]:
import sqlite3
import os

TEXT_PATH_NS = './datasets/netflix-shows/netflix_titles.csv'
DB_PATH_NS = './datasets/netflix-shows/netflix-shows.sqlite'

In [0]:
import pandas

conn_ns = sqlite3.connect(DB_PATH_NS) # connecting to the database
c_ns = conn_ns.cursor() # creating a cursor object

c_ns.execute(
        '''CREATE TABLE IF NOT EXISTS NetflixShow
             ([show_id] integer,
              [type] text,
              [title] text,
              [director] text,
              [cast] text,
              [country] text,
              [date_added] text,
              [release_year] interger,
              [rating] text,
              [duration] text,
              [listed_in] text,
              [description] text
              )
        '''
)

df_ns = pandas.read_csv("./datasets/netflix-shows/netflix-shows.zip")
#load data into convinient format
shows = [tuple(dict(row).values()) for idx, row in df_ns.iterrows()]

# insert data to db
c_ns.executemany(
    '''INSERT INTO NetflixShow(show_id, type, title, director, cast, country,
                               date_added, release_year, rating, duration, listed_in,
                               description) values (?,?,?,?,?,?,?,?,?,?,?,?)
    ''', shows
)

<sqlite3.Cursor at 0x7f87c8ed2e30>

In [0]:
# The database contains the different contents (movies, series) added to netflix (2010-2019).
# In the db we can find the date of releasing the content, and when the content was added
# to netflix.
# My guess is that we will see massive growth in the ammount of content added each year due to 
# the consuming colture and the 'netflix' revolution we all know.
c_ns.execute(
    ''' SELECT SUBSTR(date_added, -4), COUNT(*)
        FROM NetflixShow
        GROUP BY SUBSTR(date_added, -4)
        ORDER BY 1 ASC
    '''
).fetchall()

[(None, 11),
 ('2008', 2),
 ('2009', 2),
 ('2010', 1),
 ('2011', 13),
 ('2012', 7),
 ('2013', 12),
 ('2014', 25),
 ('2015', 90),
 ('2016', 456),
 ('2017', 1300),
 ('2018', 1782),
 ('2019', 2349),
 ('2020', 184)]