# Step 3: Combining the Datasets

Let's import our packages

In [1]:
# imports
import pandas as pd
import numpy as np

**Begin by merging the data we got from the APIs with the population data for each state**

First, we import the files

In [2]:
# df_quality = pd.read_csv('Article_data_with_RevIDs_Scores.csv')
df_quality = pd.read_csv('Article_data_with_RevIDs_Scores_Full_Valid.csv')
df_quality

Unnamed: 0,revision_id,page_title,state,article_quality,url
0,104730,"Abbeville, Alabama",Alabama,Stub,"https://en.wikipedia.org/wiki/Abbeville,_Alabama"
1,104761,"Adamsville, Alabama",Alabama,Stub,"https://en.wikipedia.org/wiki/Adamsville,_Alabama"
2,105188,"Addison, Alabama",Alabama,Stub,"https://en.wikipedia.org/wiki/Addison,_Alabama"
3,104726,"Akron, Alabama",Alabama,Stub,"https://en.wikipedia.org/wiki/Akron,_Alabama"
4,105109,"Alabaster, Alabama",Alabama,Stub,"https://en.wikipedia.org/wiki/Alabaster,_Alabama"
...,...,...,...,...,...
20152,140221,"Wamsutter, Wyoming",Wyoming,Start,"https://en.wikipedia.org/wiki/Wamsutter,_Wyoming"
20153,140185,"Wheatland, Wyoming",Wyoming,Stub,"https://en.wikipedia.org/wiki/Wheatland,_Wyoming"
20154,140245,"Worland, Wyoming",Wyoming,Stub,"https://en.wikipedia.org/wiki/Worland,_Wyoming"
20155,140070,"Wright, Wyoming",Wyoming,Start,"https://en.wikipedia.org/wiki/Wright,_Wyoming"


The NST population excel file is quite messy and not properly formatted. Hence, I specify an unusual row for the header and drop the first 5 rows of the data (because they are not states).

In [3]:
df_NST_pop = pd.read_excel('NST-EST2022-POP.xlsx', header=3)
df_NST_pop = df_NST_pop.iloc[5:, :]

# Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
# Source: https://stackoverflow.com/questions/16167829/in-pandas-how-can-i-reset-index-without-adding-a-new-column
df_NST_pop.reset_index(drop=True, inplace=True)
df_NST_pop

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,2020,2021,2022
0,.Alabama,5024356.0,5031362.0,5049846.0,5074296.0
1,.Alaska,733378.0,732923.0,734182.0,733583.0
2,.Arizona,7151507.0,7179943.0,7264877.0,7359197.0
3,.Arkansas,3011555.0,3014195.0,3028122.0,3045637.0
4,.California,39538245.0,39501653.0,39142991.0,39029342.0
5,.Colorado,5773733.0,5784865.0,5811297.0,5839926.0
6,.Connecticut,3605942.0,3597362.0,3623355.0,3626205.0
7,.Delaware,989957.0,992114.0,1004807.0,1018396.0
8,.District of Columbia,689546.0,670868.0,668791.0,671803.0
9,.Florida,21538226.0,21589602.0,21828069.0,22244823.0


Now, from the assignment instructions, we can infer that we only need to concern ourselves with the state column and the 2022 population data, so let's drop some columns and do some labeling. Also, notice that for some reason all the state values begin with a '.', so let's get rid of those. Lastly, let's drop the last few rows of the table which are not states

In [4]:
# Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
df_NST_pop.drop(['Unnamed: 1', 2020, 2021], axis=1, inplace=True)

# Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html
# Source: https://www.geeksforgeeks.org/how-to-rename-columns-in-pandas-dataframe/
df_NST_pop.rename(columns={'Unnamed: 0': 'state', 2022 : 'population'}, inplace=True)

# I'm not really sure how this works, but I got the following approach from
# https://stackoverflow.com/questions/42349572/remove-first-x-number-of-characters-from-each-row-in-a-column-of-a-python-datafr
df_NST_pop['state'] = df_NST_pop['state'].str[1:]

# Source: https://www.geeksforgeeks.org/remove-last-n-rows-of-a-pandas-dataframe/
df_NST_pop.drop(df_NST_pop.tail(7).index, inplace=True)

df_NST_pop

Unnamed: 0,state,population
0,Alabama,5074296.0
1,Alaska,733583.0
2,Arizona,7359197.0
3,Arkansas,3045637.0
4,California,39029342.0
5,Colorado,5839926.0
6,Connecticut,3626205.0
7,Delaware,1018396.0
8,District of Columbia,671803.0
9,Florida,22244823.0


Now we can merge the two tables and rename the 'page_title' column to 'article_title' to match the schema in the assignment instructions

In [5]:
# Source: https://towardsdatascience.com/left-join-with-pandas-data-frames-in-python-c29c85089ba4
# Source: https://stackoverflow.com/questions/53645882/pandas-merging-101
df_quality_pop = df_quality.merge(df_NST_pop, on='state', how='inner')

# Rearrange the column order
df_quality_pop = df_quality_pop[['revision_id', 'page_title', 'state', 'population', 'article_quality', 'url']]

# Rename the 'page_title' column
df_quality_pop.rename(columns = {'page_title': 'article_title'}, inplace=True)

df_quality_pop

Unnamed: 0,revision_id,article_title,state,population,article_quality,url
0,104730,"Abbeville, Alabama",Alabama,5074296.0,Stub,"https://en.wikipedia.org/wiki/Abbeville,_Alabama"
1,104761,"Adamsville, Alabama",Alabama,5074296.0,Stub,"https://en.wikipedia.org/wiki/Adamsville,_Alabama"
2,105188,"Addison, Alabama",Alabama,5074296.0,Stub,"https://en.wikipedia.org/wiki/Addison,_Alabama"
3,104726,"Akron, Alabama",Alabama,5074296.0,Stub,"https://en.wikipedia.org/wiki/Akron,_Alabama"
4,105109,"Alabaster, Alabama",Alabama,5074296.0,Stub,"https://en.wikipedia.org/wiki/Alabaster,_Alabama"
...,...,...,...,...,...,...
17030,140221,"Wamsutter, Wyoming",Wyoming,581381.0,Start,"https://en.wikipedia.org/wiki/Wamsutter,_Wyoming"
17031,140185,"Wheatland, Wyoming",Wyoming,581381.0,Stub,"https://en.wikipedia.org/wiki/Wheatland,_Wyoming"
17032,140245,"Worland, Wyoming",Wyoming,581381.0,Stub,"https://en.wikipedia.org/wiki/Worland,_Wyoming"
17033,140070,"Wright, Wyoming",Wyoming,581381.0,Start,"https://en.wikipedia.org/wiki/Wright,_Wyoming"


**Note:** Importantly, before the merge we had 20157 rows of data in our article quality table. However, after the merge we only have 17035. Why is that? 

Let's identify all states in the table of data I scraped that are not found in the population table

In [6]:
# Source: https://www.geeksforgeeks.org/check-if-element-exists-in-list-in-python/#

for state in df_quality.state.unique():
    if state not in df_NST_pop.state.unique():
        print("{} not found in NST population table".format(state))

Georgia_(U.S._state) not found in NST population table
New_Hampshire not found in NST population table
New_Jersey not found in NST population table
New_Mexico not found in NST population table
New_York not found in NST population table
North_Carolina not found in NST population table
North_Dakota not found in NST population table
Rhode_Island not found in NST population table
South_Carolina not found in NST population table
South_Dakota not found in NST population table
West_Virginia not found in NST population table


By inspection we note that the majority of the discrepancies are caused by a difference in how two-word state names are handled. For consistancy, let's replace the spaces with underscores in the population table. Additionally, let's change the name for Georgia in the article quality table from 'Georgia_(U.S._state)' to 'Georgia'.

In [7]:
# Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
df_quality.state.replace('Georgia_(U.S._state)', 'Georgia', inplace=True)

# Source: https://stackoverflow.com/questions/42462530/how-to-replace-the-white-space-in-a-string-in-a-pandas-dataframe. 
# I am not sure why the regex argument needs to be true, though
df_NST_pop.state.replace(' ', '_', regex=True, inplace=True)

Now, let's confirm that all states in the article quality table are found in the NST population table.

In [8]:
# Source: https://www.geeksforgeeks.org/check-if-element-exists-in-list-in-python/#

for state in df_quality.state.unique():
    if state not in df_NST_pop.state.unique():
        print("{} not found in NST population table".format(state))

Since there are no printed results, we can assume the column values are ready to be merged. Once again we try merging the two tables and get the following result, with the full 20,157 rows.

In [9]:
# Source: https://towardsdatascience.com/left-join-with-pandas-data-frames-in-python-c29c85089ba4
# Source: https://stackoverflow.com/questions/53645882/pandas-merging-101
df_quality_pop = df_quality.merge(df_NST_pop, on='state', how='inner')

# Rearrange the column order
df_quality_pop = df_quality_pop[['revision_id', 'page_title', 'state', 'population', 'article_quality', 'url']]

# Rename the 'page_title' column
df_quality_pop.rename(columns = {'page_title': 'article_title'}, inplace=True)

df_quality_pop

Unnamed: 0,revision_id,article_title,state,population,article_quality,url
0,104730,"Abbeville, Alabama",Alabama,5074296.0,Stub,"https://en.wikipedia.org/wiki/Abbeville,_Alabama"
1,104761,"Adamsville, Alabama",Alabama,5074296.0,Stub,"https://en.wikipedia.org/wiki/Adamsville,_Alabama"
2,105188,"Addison, Alabama",Alabama,5074296.0,Stub,"https://en.wikipedia.org/wiki/Addison,_Alabama"
3,104726,"Akron, Alabama",Alabama,5074296.0,Stub,"https://en.wikipedia.org/wiki/Akron,_Alabama"
4,105109,"Alabaster, Alabama",Alabama,5074296.0,Stub,"https://en.wikipedia.org/wiki/Alabaster,_Alabama"
...,...,...,...,...,...,...
20152,140221,"Wamsutter, Wyoming",Wyoming,581381.0,Start,"https://en.wikipedia.org/wiki/Wamsutter,_Wyoming"
20153,140185,"Wheatland, Wyoming",Wyoming,581381.0,Stub,"https://en.wikipedia.org/wiki/Wheatland,_Wyoming"
20154,140245,"Worland, Wyoming",Wyoming,581381.0,Stub,"https://en.wikipedia.org/wiki/Worland,_Wyoming"
20155,140070,"Wright, Wyoming",Wyoming,581381.0,Start,"https://en.wikipedia.org/wiki/Wright,_Wyoming"


**Now, let's merge in the geographic region corresponding to each state:**

First, let's import the geographical data. According to a classmate, she was told to focus on the 'DIVISION' column - so we can ignore the 'REGION' column. Let's also rename the columns

In [10]:
df_region = pd.read_excel('US States by Region - US Census Bureau.xlsx')

# Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
df_region.drop(['REGION'], axis=1, inplace=True)

# Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html
# Source: https://www.geeksforgeeks.org/how-to-rename-columns-in-pandas-dataframe/
df_region.rename(columns={'DIVISION': 'regional_division', 'STATE' : 'state'}, inplace=True)

df_region.head(10)

Unnamed: 0,regional_division,state
0,,
1,New England,
2,,Connecticut
3,,Maine
4,,Massachusetts
5,,New Hampshire
6,,Rhode Island
7,,Vermont
8,Middle Atlantic,
9,,New Jersey


Now, we need to fill in the DIVISION column so that it has no missing values, to make the merge easier

In [11]:
# Source: https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas
# Source: used this site to learn that .notna() is not a function for floats and what to do instead
# https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-a-certain-column-is-nan
# Source: https://pandas.pydata.org/docs/reference/api/pandas.notna.html
current_division = np.nan

for index, row in df_region.iterrows():
    if pd.notna(row.regional_division):
        current_division = row.regional_division
    
    else:
        row.regional_division = current_division

df_region.head(10)

Unnamed: 0,regional_division,state
0,,
1,New England,
2,New England,Connecticut
3,New England,Maine
4,New England,Massachusetts
5,New England,New Hampshire
6,New England,Rhode Island
7,New England,Vermont
8,Middle Atlantic,
9,Middle Atlantic,New Jersey


Next, we can drop the null values

In [12]:
df_region.dropna(inplace=True)
df_region.head()

Unnamed: 0,regional_division,state
2,New England,Connecticut
3,New England,Maine
4,New England,Massachusetts
5,New England,New Hampshire
6,New England,Rhode Island


Finally, we can merge this dataset with the other dataframe we have and rearrange the columns to match the order they appear in the assignment schema

In [13]:
# Source: https://towardsdatascience.com/left-join-with-pandas-data-frames-in-python-c29c85089ba4
# Source: https://stackoverflow.com/questions/53645882/pandas-merging-101
df_quality_pop_division = df_quality_pop.merge(df_region, on='state', how='inner')

df_quality_pop_division = df_quality_pop_division[['state', 'regional_division', 'population', 'article_title', 
                                                   'revision_id', 'article_quality']]
df_quality_pop_division

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality
0,Alabama,East South Central,5074296.0,"Abbeville, Alabama",104730,Stub
1,Alabama,East South Central,5074296.0,"Adamsville, Alabama",104761,Stub
2,Alabama,East South Central,5074296.0,"Addison, Alabama",105188,Stub
3,Alabama,East South Central,5074296.0,"Akron, Alabama",104726,Stub
4,Alabama,East South Central,5074296.0,"Alabaster, Alabama",105109,Stub
...,...,...,...,...,...,...
17495,Wyoming,Mountain,581381.0,"Wamsutter, Wyoming",140221,Start
17496,Wyoming,Mountain,581381.0,"Wheatland, Wyoming",140185,Stub
17497,Wyoming,Mountain,581381.0,"Worland, Wyoming",140245,Stub
17498,Wyoming,Mountain,581381.0,"Wright, Wyoming",140070,Start


Once again we see that instead of the 20157 rows, our merged data contains only 17500, suggesting there were issues with the merge. Let's once again check the discrepancies.

In [14]:
# Source: https://www.geeksforgeeks.org/check-if-element-exists-in-list-in-python/#

for state in df_quality_pop.state.unique():
    if state not in df_region.state.unique():
        print("{} not found in NST population table".format(state))

New_Hampshire not found in NST population table
New_Jersey not found in NST population table
New_Mexico not found in NST population table
New_York not found in NST population table
North_Carolina not found in NST population table
North_Dakota not found in NST population table
Rhode_Island not found in NST population table
South_Carolina not found in NST population table
South_Dakota not found in NST population table
West_Virginia not found in NST population table


Once again we see that there is probably an inconsistancy in how the two-name states are represented. Let's fix this by renaming the two-word state names in the region table, and then verify that this resolved the issue.

In [15]:
# Source: https://stackoverflow.com/questions/42462530/how-to-replace-the-white-space-in-a-string-in-a-pandas-dataframe. 
# I am not sure why the regex argument needs to be true, though
df_region.state.replace(' ', '_', regex=True, inplace=True)

# Source: https://towardsdatascience.com/left-join-with-pandas-data-frames-in-python-c29c85089ba4
# Source: https://stackoverflow.com/questions/53645882/pandas-merging-101
df_quality_pop_division = df_quality_pop.merge(df_region, on='state', how='inner')

df_quality_pop_division = df_quality_pop_division[['state', 'regional_division', 'population', 'article_title', 
                                                   'revision_id', 'article_quality']]
df_quality_pop_division

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality
0,Alabama,East South Central,5074296.0,"Abbeville, Alabama",104730,Stub
1,Alabama,East South Central,5074296.0,"Adamsville, Alabama",104761,Stub
2,Alabama,East South Central,5074296.0,"Addison, Alabama",105188,Stub
3,Alabama,East South Central,5074296.0,"Akron, Alabama",104726,Stub
4,Alabama,East South Central,5074296.0,"Alabaster, Alabama",105109,Stub
...,...,...,...,...,...,...
20152,Wyoming,Mountain,581381.0,"Wamsutter, Wyoming",140221,Start
20153,Wyoming,Mountain,581381.0,"Wheatland, Wyoming",140185,Stub
20154,Wyoming,Mountain,581381.0,"Worland, Wyoming",140245,Stub
20155,Wyoming,Mountain,581381.0,"Wright, Wyoming",140070,Start


We now have the expected set of results, implying our solution resolved the merge issue. Now, let's save the data to a csv as requested.

In [16]:
df_quality_pop_division.to_csv('wp_scored_city_articles_by_state.csv', index=False)

# Step 4: Analysis

For the analysis, we will compute the total articles per capita for the states and regional divisions in our data and then in step 5 we will rank the results.

In [17]:
# Source: https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/
# Source: https://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-output-from-series-to-dataframe
# https://stackoverflow.com/questions/32751229/pandas-sum-by-groupby-but-exclude-certain-columns

df_state_article_counts = df_quality_pop_division.groupby(['state', 'population'], as_index=False)[['article_title']].agg('count')
df_state_article_counts

Unnamed: 0,state,population,article_title
0,Alabama,5074296.0,443
1,Alaska,733583.0,142
2,Arizona,7359197.0,89
3,Arkansas,3045637.0,482
4,California,39029342.0,458
5,Colorado,5839926.0,258
6,Delaware,1018396.0,50
7,Florida,22244823.0,393
8,Georgia,10912876.0,465
9,Hawaii,1440196.0,130


Now, let's do the math

In [18]:
df_articles_per_capita_state = df_state_article_counts
df_articles_per_capita_state['articles_per_capita_state'] = df_state_article_counts.article_title / df_state_article_counts.population
df_articles_per_capita_state

Unnamed: 0,state,population,article_title,articles_per_capita_state
0,Alabama,5074296.0,443,8.7e-05
1,Alaska,733583.0,142,0.000194
2,Arizona,7359197.0,89,1.2e-05
3,Arkansas,3045637.0,482,0.000158
4,California,39029342.0,458,1.2e-05
5,Colorado,5839926.0,258,4.4e-05
6,Delaware,1018396.0,50,4.9e-05
7,Florida,22244823.0,393,1.8e-05
8,Georgia,10912876.0,465,4.3e-05
9,Hawaii,1440196.0,130,9e-05


**Now we compute the total articles per capita for the regional divisions**

First we count the number of articles in each state

In [19]:
# Source: https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/
# Source: https://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-output-from-series-to-dataframe
# https://stackoverflow.com/questions/32751229/pandas-sum-by-groupby-but-exclude-certain-columns

df_division_article_counts = df_quality_pop_division.groupby(['regional_division', 'state', 'population'], as_index=False)[['article_title']].agg('count')
df_division_article_counts

Unnamed: 0,regional_division,state,population,article_title
0,East North Central,Illinois,12582032.0,1226
1,East North Central,Indiana,6833037.0,551
2,East North Central,Michigan,10034113.0,1653
3,East North Central,Ohio,11756058.0,882
4,East North Central,Wisconsin,5892539.0,179
5,East South Central,Alabama,5074296.0,443
6,East South Central,Kentucky,4512310.0,393
7,East South Central,Mississippi,2940057.0,281
8,East South Central,Tennessee,7051339.0,327
9,Middle Atlantic,New_Jersey,9261699.0,533


Then we add the population and article counts in each state to find the total population and article counts for each regional division.

In [20]:
# Source: https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/
# Source: https://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-output-from-series-to-dataframe
# https://stackoverflow.com/questions/32751229/pandas-sum-by-groupby-but-exclude-certain-columns

df_division_article_counts = df_division_article_counts.groupby(['regional_division'], as_index=False)[['population', 'article_title']].agg('sum')
df_division_article_counts

Unnamed: 0,regional_division,population,article_title
0,East North Central,47097779.0,4491
1,East South Central,19578002.0,1444
2,Middle Atlantic,41910858.0,3538
3,Mountain,25514320.0,1096
4,New England,11503343.0,1314
5,Pacific,53229044.0,1217
6,South Atlantic,66781137.0,1714
7,West North Central,19721893.0,3382
8,West South Central,41685250.0,1961


Now we can compute the per capita values

In [21]:
df_articles_per_capita_region = df_division_article_counts
df_articles_per_capita_region['articles_per_capita_region'] = df_division_article_counts.article_title / df_division_article_counts.population

In [22]:
df_articles_per_capita_region

Unnamed: 0,regional_division,population,article_title,articles_per_capita_region
0,East North Central,47097779.0,4491,9.5e-05
1,East South Central,19578002.0,1444,7.4e-05
2,Middle Atlantic,41910858.0,3538,8.4e-05
3,Mountain,25514320.0,1096,4.3e-05
4,New England,11503343.0,1314,0.000114
5,Pacific,53229044.0,1217,2.3e-05
6,South Atlantic,66781137.0,1714,2.6e-05
7,West North Central,19721893.0,3382,0.000171
8,West South Central,41685250.0,1961,4.7e-05


**Article Quality:**

To determine the number of high-quality articles per state and region population (per capita) we start by determining whether or not each article is high-quality

In [23]:
df_high_quality = df_quality_pop_division

# I used the following sources to learn that np.where() existed and how to use it
# https://stackoverflow.com/questions/21702342/creating-a-new-column-based-on-if-elif-else-condition
# https://stackoverflow.com/questions/19913659/how-do-i-create-a-new-column-where-the-values-are-selected-based-on-existing-col?noredirect=1&lq=1
# https://numpy.org/doc/stable/reference/generated/numpy.where.html
# https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o
# https://stackoverflow.com/questions/16343752/numpy-where-function-multiple-conditions


df_high_quality['is_high_quality'] = np.where((df_high_quality['article_quality'] == 'FA') 
                                                      | (df_high_quality['article_quality'] == 'GA'), 1, 0)
df_high_quality[df_high_quality['is_high_quality'] == 1]

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality,is_high_quality
38,Alabama,East South Central,5074296.0,"Berlin, Alabama",23831085,FA,1
1346,California,Pacific,39029342.0,"Irvine, California",5201333,FA,1
2756,Georgia,South Atlantic,10912876.0,"Warner Robins, Georgia",35849769,FA,1
2926,Idaho,Mountain,1939033.0,"Hayden, Idaho",31378210,FA,1
3980,Illinois,East North Central,12582032.0,"Plattville, Illinois",18407774,FA,1
4102,Illinois,East North Central,12582032.0,"Shabbona, Illinois",25987812,FA,1
4700,Indiana,East North Central,6833037.0,"Millersburg, Indiana",15868841,FA,1
6901,Maine,New England,1385340.0,"Rangeley, Maine",35981352,FA,1
6967,Maine,New England,1385340.0,"Unity, Maine",46954769,FA,1
7052,Maryland,South Atlantic,6164660.0,"Church Hill, Maryland",43434329,FA,1


We can see the high-quality results in the output directly above. Now, we can group by the states and regions to find the per capita high-quality articles

In [24]:
df_high_quality_count = df_high_quality.groupby(['regional_division', 'state', 'population'], as_index=False)['is_high_quality'].agg('sum')
df_high_quality_count

Unnamed: 0,regional_division,state,population,is_high_quality
0,East North Central,Illinois,12582032.0,2
1,East North Central,Indiana,6833037.0,1
2,East North Central,Michigan,10034113.0,6
3,East North Central,Ohio,11756058.0,1
4,East North Central,Wisconsin,5892539.0,0
5,East South Central,Alabama,5074296.0,1
6,East South Central,Kentucky,4512310.0,0
7,East South Central,Mississippi,2940057.0,0
8,East South Central,Tennessee,7051339.0,0
9,Middle Atlantic,New_Jersey,9261699.0,0


Next, we find the high-quality articles per capita for states

In [25]:
df_high_quality_state = df_high_quality_count.drop('regional_division', axis=1)
df_high_quality_state['high_quality_per_capita'] = df_high_quality_state['is_high_quality'] / df_high_quality_state['population']
df_high_quality_state

Unnamed: 0,state,population,is_high_quality,high_quality_per_capita
0,Illinois,12582032.0,2,1.589568e-07
1,Indiana,6833037.0,1,1.463478e-07
2,Michigan,10034113.0,6,5.979602e-07
3,Ohio,11756058.0,1,8.506253e-08
4,Wisconsin,5892539.0,0,0.0
5,Alabama,5074296.0,1,1.970717e-07
6,Kentucky,4512310.0,0,0.0
7,Mississippi,2940057.0,0,0.0
8,Tennessee,7051339.0,0,0.0
9,New_Jersey,9261699.0,0,0.0


Lastly, we find the high-quality articles per capita for regions. We start by grouping the data by region

In [26]:
df_high_quality_region = df_high_quality_count.groupby('regional_division', as_index=False)['population', 'is_high_quality'].agg('sum')
df_high_quality_region

  df_high_quality_region = df_high_quality_count.groupby('regional_division', as_index=False)['population', 'is_high_quality'].agg('sum')


Unnamed: 0,regional_division,population,is_high_quality
0,East North Central,47097779.0,10
1,East South Central,19578002.0,1
2,Middle Atlantic,41910858.0,11
3,Mountain,25514320.0,1
4,New England,11503343.0,4
5,Pacific,53229044.0,1
6,South Atlantic,66781137.0,7
7,West North Central,19721893.0,0
8,West South Central,41685250.0,1


Now we compute the per capita values

In [27]:
df_high_quality_region['high_quality_per_capita'] = df_high_quality_region['is_high_quality'] / df_high_quality_region['population']
df_high_quality_region

Unnamed: 0,regional_division,population,is_high_quality,high_quality_per_capita
0,East North Central,47097779.0,10,2.123242e-07
1,East South Central,19578002.0,1,5.107774e-08
2,Middle Atlantic,41910858.0,11,2.624618e-07
3,Mountain,25514320.0,1,3.919368e-08
4,New England,11503343.0,4,3.47725e-07
5,Pacific,53229044.0,1,1.878674e-08
6,South Atlantic,66781137.0,7,1.0482e-07
7,West North Central,19721893.0,0,0.0
8,West South Central,41685250.0,1,2.39893e-08


# Step 5: Results

According to the assignment instructions, we are asked to produce 6 tables depicting the results of this analysis. They are listed in bold below and are direct quotes from the assignment instructions.

**1.	Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order)**.

In [28]:
# Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
table_1 = df_articles_per_capita_state.sort_values(by='articles_per_capita_state', axis=0, ascending=False, ignore_index=True)
table_1.head(10)

Unnamed: 0,state,population,article_title,articles_per_capita_state
0,Vermont,647064.0,302,0.000467
1,North_Dakota,779261.0,337,0.000432
2,South_Dakota,909824.0,296,0.000325
3,Iowa,3200517.0,999,0.000312
4,Maine,1385340.0,430,0.00031
5,Alaska,733583.0,142,0.000194
6,Pennsylvania,12972008.0,2389,0.000184
7,Wyoming,581381.0,96,0.000165
8,Michigan,10034113.0,1653,0.000165
9,Arkansas,3045637.0,482,0.000158


**2.	Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order)**.

In [29]:
# Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
table_2 = df_articles_per_capita_state.sort_values(by='articles_per_capita_state', axis=0, ascending=True, ignore_index=True)
table_2.head(10)

Unnamed: 0,state,population,article_title,articles_per_capita_state
0,North_Carolina,10698973.0,47,4e-06
1,Nevada,3177772.0,18,6e-06
2,California,39029342.0,458,1.2e-05
3,Arizona,7359197.0,89,1.2e-05
4,Virginia,8683619.0,129,1.5e-05
5,Florida,22244823.0,393,1.8e-05
6,Oklahoma,4019800.0,73,1.8e-05
7,Kansas,2937150.0,59,2e-05
8,Maryland,6164660.0,152,2.5e-05
9,Wisconsin,5892539.0,179,3e-05


**3.	Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order)**.

In [30]:
# Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
table_3 = df_high_quality_state.sort_values(by='high_quality_per_capita', axis=0, ascending=False, ignore_index=True)
table_3.head(10)

Unnamed: 0,state,population,is_high_quality,high_quality_per_capita
0,Vermont,647064.0,1,1.545442e-06
1,Maine,1385340.0,2,1.443689e-06
2,Pennsylvania,12972008.0,9,6.938016e-07
3,Michigan,10034113.0,6,5.979602e-07
4,West_Virginia,1775156.0,1,5.633308e-07
5,Idaho,1939033.0,1,5.15721e-07
6,Virginia,8683619.0,4,4.606374e-07
7,Alabama,5074296.0,1,1.970717e-07
8,Maryland,6164660.0,1,1.622149e-07
9,Illinois,12582032.0,2,1.589568e-07


**4.	Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order).**

Note: there are actually 31 states that have no high-quality articles. Hence, while I am only displaying 10 below, for the sake of the instructions, know that the 10 depicted are no worse than the other 21 states with no high-quality articles.

In [31]:
# Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
table_4 = df_high_quality_state.sort_values(by='high_quality_per_capita', axis=0, ascending=True, ignore_index=True)
table_4.head(10)

Unnamed: 0,state,population,is_high_quality,high_quality_per_capita
0,Rhode_Island,1093734.0,0,0.0
1,Washington,7785786.0,0,0.0
2,Oregon,4240137.0,0,0.0
3,Hawaii,1440196.0,0,0.0
4,Alaska,733583.0,0,0.0
5,North_Carolina,10698973.0,0,0.0
6,Oklahoma,4019800.0,0,0.0
7,New_Hampshire,1395231.0,0,0.0
8,South_Carolina,5282634.0,0,0.0
9,Iowa,3200517.0,0,0.0


**5.	Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita.**

In [32]:
# Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
table_5 = df_articles_per_capita_region.sort_values(by='articles_per_capita_region', axis=0, ascending=False, ignore_index=True)
table_5

Unnamed: 0,regional_division,population,article_title,articles_per_capita_region
0,West North Central,19721893.0,3382,0.000171
1,New England,11503343.0,1314,0.000114
2,East North Central,47097779.0,4491,9.5e-05
3,Middle Atlantic,41910858.0,3538,8.4e-05
4,East South Central,19578002.0,1444,7.4e-05
5,West South Central,41685250.0,1961,4.7e-05
6,Mountain,25514320.0,1096,4.3e-05
7,South Atlantic,66781137.0,1714,2.6e-05
8,Pacific,53229044.0,1217,2.3e-05


In [33]:
df_articles_per_capita_region.sort_values(by='population', axis=0, ascending=False, ignore_index=True)

Unnamed: 0,regional_division,population,article_title,articles_per_capita_region
0,South Atlantic,66781137.0,1714,2.6e-05
1,Pacific,53229044.0,1217,2.3e-05
2,East North Central,47097779.0,4491,9.5e-05
3,Middle Atlantic,41910858.0,3538,8.4e-05
4,West South Central,41685250.0,1961,4.7e-05
5,Mountain,25514320.0,1096,4.3e-05
6,West North Central,19721893.0,3382,0.000171
7,East South Central,19578002.0,1444,7.4e-05
8,New England,11503343.0,1314,0.000114


**6.	Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita.**

In [34]:
# Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
table_6 = df_high_quality_region.sort_values(by='high_quality_per_capita', axis=0, ascending=False, ignore_index=True)
table_6

Unnamed: 0,regional_division,population,is_high_quality,high_quality_per_capita
0,New England,11503343.0,4,3.47725e-07
1,Middle Atlantic,41910858.0,11,2.624618e-07
2,East North Central,47097779.0,10,2.123242e-07
3,South Atlantic,66781137.0,7,1.0482e-07
4,East South Central,19578002.0,1,5.107774e-08
5,Mountain,25514320.0,1,3.919368e-08
6,West South Central,41685250.0,1,2.39893e-08
7,Pacific,53229044.0,1,1.878674e-08
8,West North Central,19721893.0,0,0.0


# References:

- [1]. Homework 2 - Considering Bias in Data.docx