## Problem 2

Millions of searches happen on modern search engines like Google. Advertisers want to know about search interests in order to target consumers effectively. In this notebook, we will look at "search interest scores" for the 2016 Olympics obtained from [Google Trends](https://trends.google.com/trends/).

This problem is divided into four (4) exercises, numbered 0-3. They are worth a total of ten (10) points.

> By way of background, a search interest score is computed by region and normalized by population size, in order to account for differences in populations between different regions. You can read more about search interest here.
https://medium.com/google-news-lab/what-is-google-trends-data-and-what-does-it-mean-b48f07342ee8

In [1]:
# Some modules and functions we'll need

import pandas as pd
#!pip install pysqlite
import sqlite3
from IPython.display import display

def canonicalize_tibble(X):
    """Returns a tibble in _canonical order_."""
    # Enforce Property 1:
    var_names = sorted(X.columns)
    Y = X[var_names].copy()

    # Enforce Property 2:
    Y.sort_values(by=var_names, inplace=True)

    # Enforce Property 3:
    Y.set_index([list(range(0, len(Y)))], inplace=True)

    return Y

def tibbles_are_equivalent(A, B):
    """Given two tidy tables ('tibbles'), returns True iff they are
    equivalent.
    """
    A_canonical = canonicalize_tibble(A)
    B_canonical = canonicalize_tibble(B)
    cmp = A_canonical.eq(B_canonical)
    return cmp.all().all()

## The data

We will be working with two sources of data.

The first is the [search interest data taken from Google Trends](https://raw.githubusercontent.com/googletrends/data/master/20160819_OlympicSportsByCountries.csv).

The second is [world population data taken from the U.S. Census Bureau](https://www.census.gov/population/international/data/idb/).

For your convenience, these data are stored in two tables in a SQLite database stored in a file named `olympics/sports.db`. We will need to read the data into dataframes before proceeding.

**Exercise 0** (2 points). The SQLite database has two tables in it, one named `search_interest` and the other named `countries`. Implement the function, **`read_data(conn)`** below, to read these tables into a pair of Pandas dataframes.

In particular, assume that **`conn`** is an open SQLite database connection object. Your function should return a pair of dataframes, `(search_interest, countries)`, corresponding to these tables. (See the `# Demo code` below.)

In [2]:
def read_data(conn):
    #
    # YOUR CODE HERE
    #
    c = conn.cursor()
    search_interest=c.execute('SELECT * FROM search_interest').fetchall()
    
    countries=c.execute('SELECT * FROM countries').fetchall()
    return pd.DataFrame(search_interest), pd.DataFrame(countries)
# Demo code:
#conn = sqlite3.connect('olympics/sports.db')
conn=sqlite3.connect('sports.db')
search_interest, countries = read_data(conn)
conn.close()

print("=== search_interest ===")
display(search_interest.head())

print("=== countries ===")
display(countries.head())

=== search_interest ===


Unnamed: 0,0,1,2,3
0,0,Iran,1,Archery
1,1,South Korea,2,Archery
2,2,Mexico,1,Archery
3,3,Netherlands,1,Archery
4,4,Aruba,16,Artistic gymnastics


=== countries ===


Unnamed: 0,0,1,2,3,4,5
0,0,Reunion,2016,850996,2511,340.0
1,1,Martinique,2016,385551,128,340.0
2,2,Guadeloupe,2016,402119,1628,250.0
3,3,Myanmar,2016,54616716,653508,83.6
4,4,CzechRepublic,2016,10660932,77247,138.0


In [3]:
# Test cell: `read_data_test`

df1 = pd.read_csv("olympics/OlympicSportsByCountries_2016.csv")
df2 = pd.read_csv("olympics/census_data_2016.csv")

try:
    ref = pd.read_csv
    del pd.read_csv
    conn = sqlite3.connect('sports.db')
    search_interest, countries = read_data(conn)
    conn.close()
except AttributeError as e:
    raise RuntimeError("Were you using read_csv to read the csv solution ?")
finally:
    pd.read_csv = ref

print("\n(Passed!)")


(Passed!)


**Exercise 1** (3 points). In this exercise, compute the answers to the following three questions about the `search_interests` data.

1. Which country has the "most varied" interest in Olympic sports? That is, in the dataframe of search interests, which country appears most often? Store the result in the variable named **`top_country`**.
2. Which Olympic sport generates interest in the largest number of countries? Store the result in the variable **`top_sport`**.
3. How many sports are listed in the table? Store the result in the variable **`sport_count`**.

> **Hint** : The [`scipy.stats.mode()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html) function could be useful in this exercise.

In [4]:
#search_interest.head()

In [5]:
from scipy.stats import mode
import numpy as np
top_country = None
top_sport = None
sport_count = None

def compute_basic_stats():
    top_country=np.argmax(pd.value_counts(search_interest[1]))
    top_sport=np.argmax(pd.value_counts(search_interest[3]))
    sport_count=len(pd.unique(search_interest[3]))
    return top_country, top_sport, sport_count

top_country, top_sport, sport_count = compute_basic_stats()

The current behaviour of 'Series.argmax' is deprecated, use 'idxmax'
instead.
The behavior of 'argmax' will be corrected to return the positional
maximum in the future. For now, use 'series.values.argmax' or
'np.argmax(np.array(values))' to get the position of the maximum
row.
  return bound(*args, **kwds)


In [6]:
# Test code
try:
    ref = search_interest
    del search_interest
    top_country, top_sport, sport_count = compute_basic_stats()
except NameError:
    search_interest = ref
    top_country, top_sport, sport_count = compute_basic_stats()
    assert top_country == 'Croatia' or top_country == 'New Zealand'
    assert top_sport == 'Athletics (Track & Field)'
    assert sport_count == 34
except Exception as e:
    print(e)
    print("Were you not using the search_interest dataframe to compute the stats ?")
finally:
    search_interest = ref

print("\n(Passed!)")


(Passed!)


The current behaviour of 'Series.argmax' is deprecated, use 'idxmax'
instead.
The behavior of 'argmax' will be corrected to return the positional
maximum in the future. For now, use 'series.values.argmax' or
'np.argmax(np.array(values))' to get the position of the maximum
row.
  return bound(*args, **kwds)


In [7]:
#countries[[1,3]].head()

## Worldwide popularity of a sport

To estimate the popularity of a sport, it is not good enough to get only a count of the countries where the sport generated enough search interest. We might get a better estimate of popularity by computing a weighted average of search interest that accounts for differences in search interests and populations among countries.

**Exercise 2** (2 points). Before we can perform a weighted average, we need to find the weights for each country. To do that, we need the population for each of the countries in the search interest table, which we can obtain by querying the census population table.

Complete the function **`join_pop(si, c)`** below to perform this task. That is, given the dataframe of search interests, **`si`**, and the census data, **`c`**, this function should join the `Population` column from `c` to `si` and return the result.

The returned value of `join_pop(si, c)` should be a copy of `si` with one additional column named `'Population'` that holds the corresponding population value from `c`.

> To match the country names between the `si` and `c` dataframes, note that the `si` dataframe's `'Country'` column includes spaces whereas `c` does not. You'll want to account for that by, for instance, stripping out the spaces from `si` before merging or joining with `c`.

In [8]:
def translate_country(country):
    """
    Removes spaces from country names
    """
    return country.replace(' ', '')

def join_pop(si, c):
    join_df = si.copy()
    join_df[1]=join_df[1].apply(translate_country)
    c=c[[1,3]]
    join_df=join_df.merge(c,on=1)
    join_df.columns=['index','Country','Search_Interest','Sport','Population']
    join_df=join_df.sort_values(by=['Sport',"index"])
    join_df.index=range(join_df.shape[0])
    return join_df

total_world_population = sum(countries[3])
join_df = join_pop(search_interest, countries)

display(join_df.head())

Unnamed: 0,index,Country,Search_Interest,Sport,Population
0,0,Iran,1,Archery,80987449
1,1,SouthKorea,2,Archery,50924172
2,2,Mexico,1,Archery,123166749
3,3,Netherlands,1,Archery,17016967
4,4,Aruba,16,Artistic gymnastics,113648


**Weighing search interest by population.** Suppose that to compare different Olympic sports by global popularity, we want to account for each country's population.

For instance, suppose we are looking at the global search interest in volleyball. If volleyball's search interest equals `1` in both China and the Netherlands, we might weigh China's search interest more since it is the more populous contry.

To determine the weights for each country, let's just use each country's fraction of the global population. Recall that an earlier code cell computed the variable, `total_world_population`, which is the global population. Let the weight of a given country be its population divided by the global population. (For instance, if the global population is 6 billion people and the population of India is 1 billion, then India's "weight" would be one-sixth.)

**Exercise 3** (3 points). Create a dataframe named `ranking` with two columns, `'Sport'` and `'weighted_interest'`, where there is one row per sport, and each sport's `'weighted_interest'` is the overall weighted interest across countries, using the population weights for each country as described above.

> **Hint**: Consider using [groupby()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) for Pandas DataFrames. It is very similar to `GROUP BY` in SQL.

In [9]:

join_df["fraction"]=join_df["Population"]/total_world_population
join_df["weighted_interest"]=join_df["fraction"]*join_df["Search_Interest"]

ranking=pd.DataFrame(join_df.groupby("Sport")["weighted_interest"].sum()).sort_values(by='weighted_interest',ascending=False)
ranking=ranking.reset_index()
#ranking.columns=["Sport",'weighted_interest']
# top 10 sports
display(ranking[:10])

Unnamed: 0,Sport,weighted_interest
0,Swimming,5.983388
1,Athletics (Track & Field),4.273728
2,Badminton,3.051064
3,Artistic gymnastics,2.337363
4,Tennis,2.119308
5,Football (Soccer),1.345433
6,Table tennis,0.929301
7,Wrestling,0.845934
8,Diving,0.72784
9,Basketball,0.462788


In [10]:
# Test cell: `ranking_test`

ranking_ref = pd.read_csv("olympics/rankings_ref.csv")
assert (ranking_ref["Sport"] == ranking["Sport"]).all()

print("\n(Passed!)")


(Passed!)


**Fin!** You have reached the end of this problem. Be sure to submit it before moving on.