# HW2 - Data Ingestion and Analysis

# Data Ingestion

#### Import Packages

In [44]:
import pandas as pd
from tqdm import tqdm
import json, time, urllib.parse
import requests

#### Get US cities and related articles data

In [45]:
cities_wiki_df = pd.read_csv('us_cities_by_state_SEPT.2023.csv')

### Page Info Extraction using API

Extracting Page info for all articles using API call to Wikipedia. Using the example code to do the API call and set constants.

In [46]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [47]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Creating a function to get page info for each article title. We will pass article title to this function and it will store the information related to the article including the revision ID into a dictionary for all 22,157 articles.

In [48]:
# Function Definition
def get_page_info(df):
    article_title = df['page_title']
    try:
        info = request_pageinfo_per_article(article_title)
        page_dict = info['query']['pages']
        all_page_info_dict.update(page_dict)
    except:
        no_page_info_list.append(article_title)

Create an empty dictionary and then append all the generated dict outputs from the web pull into it

In [49]:
# Function Call
all_page_info_dict = dict()
no_page_info_list = []
x = cities_wiki_df.iloc.apply(get_page_info,axis=1)

Convert the final dictionary with all the dicts into a dataframe and save it for further use

In [50]:
page_info_df = pd.DataFrame(all_page_info_dict).T.reset_index(drop=True)
page_info_df.to_csv('page_info_df.csv')

## ORES Score extraction using API

Create a final page info df which has article title and corresponding revision ID which will be used for ORES score pull

In [58]:
page_info_df_final = page_info_df[['pageid','title','lastrevid']]

Extracting Quality Scores for all articles using API call to ORES. Using the example code to do the API call and set constants.

In [59]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<nshah23@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

Setting username and Access Token which I created using Wikimedia account

In [60]:
USERNAME = "Nshah23"
ACCESS_TOKEN = "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiIwMjExZWE0YzgxYjI1OTBlYjAyMzE4MzkwOGQ5NDRhYSIsImp0aSI6IjA0YjIxZGZiMjkzZWVjYmZkNmUxNmY2OWZkMTQ5YjYzODRiMzdmNTRiYzJmNTNlMWQxN2IwYzFhZDkzYzQwNjE0OGJiNTc4MmZiM2VlNmQ2IiwiaWF0IjoxNjk3NDk5MzY4LjI0NzM0NywibmJmIjoxNjk3NDk5MzY4LjI0NzM1LCJleHAiOjMzMjU0NDA4MTY4LjI0NDc3NCwic3ViIjoiNzQwMjE2MTIiLCJpc3MiOiJodHRwczovL21ldGEud2lraW1lZGlhLm9yZyIsInJhdGVsaW1pdCI6eyJyZXF1ZXN0c19wZXJfdW5pdCI6NTAwMCwidW5pdCI6IkhPVVIifSwic2NvcGVzIjpbImJhc2ljIl19.CB6-OrWFJZ7DMAyv_Iktz7UKL_e2Wz_ZeFY6lZQUREpgntObptjYq-GKgKVfKXfEE19HZTg_RCCe4rbQ5mFDcCP9Jg5TFXQWOE8ot5QCi4dnOPjlV0DCWNmJQQRU7ulwvl1Bqzmt33REU9n9FszvZ65vaRKfHq_leB7SB-Yldr9qUfsRqJ3nUDLPXer6NaFwR_YUPP-9gh225SgMgjW-_6n-vDYBtrw-3WL4PcXbZImcMA-J0_0QNUA-LyfwqodmCSIXexOMFFUkv7mj7Rz-q-Qi-g9pehAwvVV3WKY3bxWX99KWGqVINnP8UBP7K0lMX9qr5QtJG-n6r28hRTEU6GjjzKYxfGhW6-g1_LibWy-J4QXSqAW8sRooYrhTrTfQeKa9EucgHO3390G4xI4et_KIJT1rChvmPh5_c_eK69EJeQ36KDMsoO2le9SjRh0ed_9hd55f3xzix3G0H-8sZMFkhbLlGATuJvRVpWwbOSUhBTMD8CBTfCNegFSY894bR9rSFyEYEf9EeYCPQh5ICccrX4vxOKiloBvF1wcXcdOyhYe62MBRKgGDSs01tg9dY1gMJy6rEIAvVb-qEW1mvkUoUPFSh-bUoIkOA9rRzSmpbyAWcyEZbCr4aKJHQftDDdTHNHJFNFbO6ipWsk8Jt_PRIjwG_sMun-3cWFTx4V0"

Function to request scores for all articles based on revision ID.

In [16]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Defining a Function to get the Quality Score (Prediction) for each article based on its current Revision ID. ORES API is used to extract scores by passing each article in the function using the apply function and then getting the scores out as a dictionary for evry  Revision ID.

In [55]:
# Function Definition
def get_ores(df):
    try:
        score = request_ores_score_per_article(article_revid=df['lastrevid'],
                                                   email_address="nshah23@uw.edu",
                                                   access_token=ACCESS_TOKEN)
        ores_dict[df['lastrevid']]=score['enwiki']['scores'][str(df['lastrevid'])]['articlequality']['score']
    except:
        noscore_list.append(df['title'])

Calling the function and passing each row: article title and revision ID to get the corresponding score dictionary. Score dictionary willbe generated for every article and is appended to an empty dictionary. Thus a final dictionary will be generated with the prediction and score corresponding to revision ID key.

In [56]:
# Function Call
ores_dict = dict()
noscore_list = []
y = page_info_df_final.apply(get_ores, axis=1)

Saving the dictionary into a dataframe and ahnging column names based on further requirement

In [57]:
ores_df = pd.DataFrame(ores_dict).T
ores_df = ores_df.reset_index()
ores_df.rename(columns={'index':'lastrevid'},inplace=True)
ores_df = ores_df[['lastrevid','prediction']]

# Data Merge

Getting all the dataframes and merging them for the final analysis

In [3]:
page_info_df = page_info_df[['title','lastrevid']]

# Merging the wikipedia and ores df on lastrevid
wiki_ores_df = pd.merge(page_info_df,ores_df,on="lastrevid",how="inner")

us_cities_df = pd.read_csv('us_cities_by_state_SEPT.2023.csv')

# Merging us_cities_by_state_SEPT.2023 df with wiki_ores_df on article title to get the revision ID.
wiki_final = pd.merge(us_cities_df,wiki_ores_df,left_on='page_title',right_on='title',how="left")

#Get Population df - Estimated population 2022
population_df = pd.read_csv('NST-EST2022-ALLDATA.csv')
population_df = population_df[['NAME','POPESTIMATE2022']]

# Create all merge df after merging all data frames.
all_merged_df = pd.merge(wiki_final,population_df,left_on="state",right_on="NAME",how="left")
# Replace "_" and " (U.S. state)" as they are extra credentials applied to some states based on population data and had extra "_". 
all_merged_df['state'] = all_merged_df['state'].str.replace("_"," ")
all_merged_df['state'] = all_merged_df['state'].str.replace(" (U.S. state)","")

# Add Region and Division
region_division_df = pd.read_excel('US States by Region - US Census Bureau.xlsx',index_col=[0,1])
region_division_df = region_division_df.droplevel(0).dropna().reset_index()
region_division_df.columns=['regional_division','state']

Merging all the datasets into a final dataframe. Creating the final dataframe for this assignment which will be used for Analysis

In [4]:
# Final Dataframe
all_merged_df_division = pd.merge(all_merged_df,region_division_df,on="state",how="inner")
all_merged_df_division.drop(columns=['page_title','NAME','url'],inplace=True)
all_merged_df_division = all_merged_df_division.rename(columns={'prediction':'article_quality','POPESTIMATE2022':'population','title':'article_title','lastrevid':'revision_id'})
wp_scored_city_articles_by_state = all_merged_df_division[['state','regional_division','population','article_title','revision_id','article_quality']]

Save the final  df as 'wp_scored_city_articles_by_state.csv'

In [5]:
wp_scored_city_articles_by_state.to_csv('wp_scored_city_articles_by_state.csv',index=False)

# Analysis

The analysis consists of calculating total-articles-per-population (a ratio representing the number of articles per person)  and high-quality-articles-per-population (a ratio representing the number of high-quality articles per person) on a state-by-state and divisional basis. All of these values are “per capita” ratios.

Grouping the data by state, regional_division and population and getting total articles. Then calculating articles perrr capita.

In [7]:
final_df = wp_scored_city_articles_by_state
state_df = final_df.groupby(['state','regional_division','population']).agg(total_articles=('article_title','count')).reset_index()
state_df['tot_articles_per_capita'] = state_df['total_articles']/state_df['population']

## Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order)

In [8]:
top10states_by_coverage = state_df.sort_values('tot_articles_per_capita',ascending=False).head(10).reset_index(drop=True)
top10states_by_coverage

Unnamed: 0,state,regional_division,population,total_articles,tot_articles_per_capita
0,Vermont,New England,647064.0,329,0.000508
1,Maine,New England,1385340.0,483,0.000349
2,Iowa,West North Central,3200517.0,1041,0.000325
3,Alaska,Pacific,733583.0,149,0.000203
4,Pennsylvania,Middle Atlantic,12972008.0,2549,0.000197
5,Alabama,East South Central,5074296.0,918,0.000181
6,Michigan,East North Central,10034113.0,1767,0.000176
7,Wyoming,Mountain,581381.0,99,0.00017
8,Arkansas,West South Central,3045637.0,499,0.000164
9,Missouri,West North Central,6177957.0,948,0.000153


## Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order) .

In [9]:
bottom10states_by_coverage = state_df.sort_values('tot_articles_per_capita').head(10).reset_index(drop=True)
bottom10states_by_coverage

Unnamed: 0,state,regional_division,population,total_articles,tot_articles_per_capita
0,Nevada,Mountain,3177772.0,19,6e-06
1,California,Pacific,39029342.0,475,1.2e-05
2,Arizona,Mountain,7359197.0,91,1.2e-05
3,Oklahoma,West South Central,4019800.0,74,1.8e-05
4,Florida,South Atlantic,22244823.0,410,1.8e-05
5,Kansas,West North Central,2937150.0,61,2.1e-05
6,Maryland,South Atlantic,6164660.0,156,2.5e-05
7,Virginia,South Atlantic,8683619.0,264,3e-05
8,Wisconsin,East North Central,5892539.0,191,3.2e-05
9,Washington,Pacific,7785786.0,280,3.6e-05


Filtering the data with a high-quality filter i.e. article_quality is "FA" OR "GA". Grouping the data then generated by state, regional_division, and population and getting total articles. Then calculating articles per capita.

In [11]:
high_quality_list = ["FA","GA"]
high_quality_df = final_df[final_df['article_quality'].isin(high_quality_list)]
quality_df = high_quality_df.groupby(['state','regional_division','population']).agg(total_articles=('article_title','count')).reset_index()
quality_df['tot_articles_per_capita'] = quality_df['total_articles']/quality_df['population']

## Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order) 

In [12]:
top10states_by_quality = quality_df.sort_values('tot_articles_per_capita',ascending=False).head(10).reset_index(drop=True)
top10states_by_quality

Unnamed: 0,state,regional_division,population,total_articles,tot_articles_per_capita
0,Vermont,New England,647064.0,45,7e-05
1,Wyoming,Mountain,581381.0,39,6.7e-05
2,Montana,Mountain,1122867.0,55,4.9e-05
3,Pennsylvania,Middle Atlantic,12972008.0,563,4.3e-05
4,Missouri,West North Central,6177957.0,262,4.2e-05
5,Alaska,Pacific,733583.0,31,4.2e-05
6,Iowa,West North Central,3200517.0,104,3.2e-05
7,Oregon,Pacific,4240137.0,134,3.2e-05
8,Maine,New England,1385340.0,43,3.1e-05
9,Minnesota,West North Central,5717184.0,167,2.9e-05


## Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order).

In [13]:
bottom10states_by_quality = quality_df.sort_values('tot_articles_per_capita').head(10).reset_index(drop=True)
bottom10states_by_quality

Unnamed: 0,state,regional_division,population,total_articles,tot_articles_per_capita
0,Nevada,Mountain,3177772.0,8,3e-06
1,Arizona,Mountain,7359197.0,24,3e-06
2,Virginia,South Atlantic,8683619.0,36,4e-06
3,California,Pacific,39029342.0,171,4e-06
4,Florida,South Atlantic,22244823.0,119,5e-06
5,Kansas,West North Central,2937150.0,20,7e-06
6,Maryland,South Atlantic,6164660.0,42,7e-06
7,Oklahoma,West South Central,4019800.0,31,8e-06
8,Massachusetts,New England,6981974.0,61,9e-06
9,Louisiana,West South Central,4590241.0,44,1e-05


Calculating total articles by coverage and quality based on division. Calculating total articles on top of the code.

## Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita.

In [14]:
division_df = final_df.groupby('regional_division').agg(total_articles=('article_title','count'),population=('population','sum')).reset_index()
division_df['tot_articles_per_capita'] = division_df['total_articles']/division_df['population']
census_divisions_by_total_coverage = division_df.sort_values('tot_articles_per_capita',ascending=False).reset_index(drop=True)
census_divisions_by_total_coverage

Unnamed: 0,regional_division,total_articles,population,tot_articles_per_capita
0,New England,1427,3339658000.0,4.272892e-07
1,Mountain,1208,4005255000.0,3.016038e-07
2,West North Central,3566,14280890000.0,2.497043e-07
3,East South Central,1984,9921118000.0,1.999775e-07
4,South Atlantic,1933,12522850000.0,1.543578e-07
5,Middle Atlantic,3773,33156450000.0,1.137938e-07
6,East North Central,4726,50006000000.0,9.450867e-08
7,Pacific,1284,22348600000.0,5.745328e-08
8,West South Central,2094,40066020000.0,5.226374e-08


## Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita.

In [15]:
high_quality_list = ["FA","GA"]
division_high_quality_df = final_df[final_df['article_quality'].isin(high_quality_list)]
division_quality_df = division_high_quality_df.groupby('regional_division').agg(total_articles=('article_title','count'),population=('population','sum')).reset_index()
division_quality_df['tot_articles_per_capita'] = division_quality_df['total_articles']/division_quality_df['population']
census_divisions_by_high_quality = division_quality_df.sort_values('tot_articles_per_capita',ascending=False).reset_index(drop=True)
census_divisions_by_high_quality

Unnamed: 0,regional_division,total_articles,population,tot_articles_per_capita
0,New England,224,514587900.0,4.352998e-07
1,Mountain,341,1051078000.0,3.24429e-07
2,West North Central,635,2964991000.0,2.141659e-07
3,East South Central,369,2038105000.0,1.810506e-07
4,South Atlantic,514,3244120000.0,1.584405e-07
5,Middle Atlantic,1047,7303241000.0,1.43361e-07
6,East North Central,714,7360117000.0,9.700932e-08
7,Pacific,481,8203508000.0,5.863345e-08
8,West South Central,631,15080180000.0,4.184299e-08


#### Note: I have removed allthe intermediate data outputs based on the comments on HW 1. Hope this makes the notebook  less messy.