# Countries and Guns

My midterm project will be exploring a dataset from Kaggle that includes around data about countries around the world with data such as population, GDP, and more, and the Wikipedia page that shows the guns per capita by country.

Since the right to bear arms is a continued, controversial topic in United States policy, I am curious to see what the relationship is with other countries' gun use, and if there are any correlations between guns per capita and other variables such as GDP.


The world_countries.csv is originally found on Kaggle (https://www.kaggle.com/datasets/fernandol/countries-of-the-world) which was created by the US government. This is a common dataset since it is part of the public domain. 

The arms data was created by the Small Arms Survey which is an independent research project from a graduate institute in Sweden. They aim to find information on small arms to support governments to prevent armed violence. 

In [763]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import altair as alt
# Website of countries and firearms 
page = requests.get('https://en.wikipedia.org/wiki/Estimated_number_of_civilian_guns_per_capita_by_country')
print(page.status_code)

200


In [764]:
arm_text = page.text # now called net_text
arm_text = BeautifulSoup(arm_text)

In [766]:
# All of country data
data = pd.read_csv('/Users/UnlawfulWaffle/CSV/world_countries.csv')

In [767]:
#arm_text #to check

In [768]:
# Back to website -- scraping data needed
table=arm_text.find('table',{'class':"wikitable"})
#table # to check
#type(table)

In [769]:
table=pd.read_html(str(table))
# convert list to dataframe
table=pd.DataFrame(table[0])

In [770]:
# dropping columns that are not needed
table = table.drop(columns=['Notes', 'Computation method'])
table

Unnamed: 0,Location,Firearms per 100,Region,Subregion,Population 2017,Civilian firearms,Registered firearms,Unregistered firearms
0,United States,120.5,Americas,North America,326474000,393347000,1073743.0,"392,273,257 Est."
1,Falkland Islands,62.1,Americas,South America,3000,2000,1705.0,295
2,Yemen,52.8,Asia,Western Asia,28120000,14859000,,
3,New Caledonia,42.5,Oceania,Melanesia,270000,115000,55000.0,60000
4,Serbia,39.1,Europe,Southern Europe,6946000,2719000,1186086.0,1532914
...,...,...,...,...,...,...,...,...
225,Christmas Island,0.0,Asia,South-East Asia,2000,–,,
226,Holy See,0.0,Europe,Southern Europe,1000,–,,
227,Indonesia,0.0,Asia,South-East Asia,263510000,82000,41102.0,40898
228,Nauru,0.0,Oceania,Melanesia,10000,–,,


In [771]:
# rename column to match data's column
table = table.rename(columns={'Location': 'Country'})
table = table.drop(["Region", "Subregion", "Population 2017"], axis=1) #dropping matching data

In [772]:
# cleaning data
#print(table.to_string()) #checking
# rename countries to match
table['Country'] = table['Country'].replace({'South Korea': 'Korea, South', 'North Korea' : 'Korea, North',
                                             'Congo, Dem. Rep.' : 'DR Congo', 'Bosnia and Herzegovina' :'Bosnia & Herzegovina',
                                            "United States" : "United States of America", "U.S. Virgin Islands" : "Virgin Islands",
                                             "Trinidad and Tobago" :"Trinidad & Tobago", "Turks and Caicos Islands": "Turks & Caicos Islands"})


In [773]:
#print(table.to_string()) #to check

In [778]:
# merging the data
merged = pd.merge(table, data, on='Country', how='outer') # outer join to all match and creates NAs

In [785]:
merged = merged[merged['Firearms per 100'].notna()] #drop rows with 'Firearms' as NA
merged

Unnamed: 0,Country,Firearms per 100,Civilian firearms,Registered firearms,Unregistered firearms,Code,Region,Population,Area,Pop. Density,...,Phones,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,United States of America,120.5,393347000,1073743.0,"392,273,257 Est.",USA,NORTHERN AMERICA,298444215.0,9631420.0,31.0,...,898.0,19.13,0.22,80.65,3.0,14.14,8.26,0.010,0.204,0.787
1,Falkland Islands,62.1,2000,1705.0,295,,,,,,...,,,,,,,,,,
2,Yemen,52.8,14859000,,,YEM,NEAR EAST,21456188.0,527970.0,40.6,...,37.2,2.78,0.24,96.98,1.0,42.89,8.30,0.135,0.472,0.393
3,New Caledonia,42.5,115000,55000.0,60000,NCL,OCEANIA,219246.0,19060.0,11.5,...,252.2,0.38,0.33,99.29,2.0,18.11,5.69,0.150,0.088,0.762
4,Serbia,39.1,2719000,1186086.0,1532914,SRB,EASTERN EUROPE,9396411.0,88361.0,106.3,...,285.8,33.35,3.20,63.45,,,,0.166,0.255,0.579
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
225,Christmas Island,0.0,–,,,,,,,,...,,,,,,,,,,
226,Holy See,0.0,–,,,,,,,,...,,,,,,,,,,
227,Indonesia,0.0,82000,41102.0,40898,IDN,ASIA (EX. NEAR EAST),245452739.0,1919440.0,127.9,...,52.0,11.32,7.23,81.45,2.0,20.34,6.25,0.134,0.458,0.408
228,Nauru,0.0,–,,,NRU,OCEANIA,13287.0,21.0,632.7,...,143.0,0.00,0.00,100.00,2.0,24.76,6.70,,,


# Statisical Analysis with Pandas

Searching to find correlations through statistics

As my first statistical analysis, I am checking for many possible statistics that could be correlated, but this data is hard to interpret with raw numbers. I am doing exploratory data analysis with a large net to see if I can find anything interesting. 

In [700]:
merged.pivot_table(columns='Region', values=['Literacy', 'Firearms per 100', 'GDP', 'Deathrate', 'Infant mortality', 'Pop. Density'])

Region,ASIA (EX. NEAR EAST),BALTICS,C.W. OF IND. STATES,EASTERN EUROPE,LATIN AMER. & CARIB,NEAR EAST,NORTHERN AFRICA,NORTHERN AMERICA,OCEANIA,SUB-SAHARAN AFRICA,WESTERN EUROPE
Deathrate,7.637143,12.63,10.341667,10.284545,6.374318,4.945,4.806,7.694,5.810526,15.16,9.354643
Firearms per 100,4.543478,9.7,4.825,14.054545,9.555,20.335714,5.08,45.525,7.072222,3.729545,18.504167
GDP,8053.571429,11300.0,4000.0,9808.333333,8620.454545,11850.0,5460.0,26100.0,8247.619048,2323.529412,27046.428571
Infant mortality,41.78,8.103333,44.41,12.686667,20.321364,23.677857,30.916,8.628,20.203684,80.039216,4.730357
Literacy,79.553571,99.733333,98.725,97.088889,90.513953,79.521429,67.24,97.75,88.835294,62.51,98.391304
Pop. Density,1264.825,39.833333,56.708333,100.9,134.047727,174.614286,38.933333,260.86,131.180952,92.264706,952.042857


Here I am making a smaller table to compare deathrate to firearms to find if this will be easier to interpret. I am also interested in climate as a variable. 

Despite the smaller table, this is still hard to interpret, but climate seems to have no definitive correlation.

In [666]:
merged.pivot_table(columns='Climate', values=['Deathrate', 'Firearms per 100'])

Climate,1.0,1.5,2.0,2.5,3.0,4.0
Deathrate,9.821786,9.9375,8.598426,15.0,9.779565,9.376667
Firearms per 100,12.17037,3.4625,5.673958,4.05,16.594872,15.766667


Now, I am testing correlations and using Firearms and GDP as my first test using a small dataset. 
There is a medium positive correlation between firearms and GDP, so now I want to create a larger correlation table to compare more.

In [820]:
corr_data = pd.DataFrame({'Firearms per 100': merged['Firearms per 100'], 'GDP': merged['GDP']})
print(corr_data.corr())

                  Firearms per 100       GDP
Firearms per 100          1.000000  0.446437
GDP                       0.446437  1.000000


In the following chunk I created a correlation matrix with the same columns in my first test for statistical analysis with a series that lists the highest correlations. Firearms per 100 and GDP have the highest positive correlation, and Firearms per 100 and Infant mortality have the most negative correlation.

There are many pros to both correlation series and matrices such as the ease of interpretation to find correlation, rather than raw numbers and quick comparisons between other correlations. For correlation matrices, a con is that with larger matrices, it is hard to track the correlations, especially due to the '1.0' diagonal. A limitation of correlations is that it does not prove correlation = causation.

In [732]:
corr_data2 = pd.DataFrame({'Literacy': merged['Literacy'], 'Firearms per 100': merged['Firearms per 100'], 'GDP': merged['GDP'],
                           'Deathrate': merged['Deathrate'], 'Infant mortality': merged['Infant mortality'],
                           'Pop. Density': merged['Pop. Density']})


corr_data2 = corr_data2.dropna()
# Calculate the correlation matrix
corr_matrix = corr_data2.corr()
corr_matrix
print(corr_matrix.stack().sort_values(ascending=False))

Literacy          Literacy            1.000000
Firearms per 100  Firearms per 100    1.000000
Infant mortality  Infant mortality    1.000000
Deathrate         Deathrate           1.000000
GDP               GDP                 1.000000
Pop. Density      Pop. Density        1.000000
Infant mortality  Deathrate           0.668013
Deathrate         Infant mortality    0.668013
Literacy          GDP                 0.516280
GDP               Literacy            0.516280
                  Firearms per 100    0.464914
Firearms per 100  GDP                 0.464914
                  Literacy            0.240606
Literacy          Firearms per 100    0.240606
GDP               Pop. Density        0.216430
Pop. Density      GDP                 0.216430
                  Literacy            0.094558
Literacy          Pop. Density        0.094558
Firearms per 100  Pop. Density        0.012613
Pop. Density      Firearms per 100    0.012613
                  Deathrate          -0.025303
Deathrate    

# Graphical Analysis with Altair

Searching to find correlations through graphs

There is a slight positive correlation between literacy and firearms per 100 as one can see a curve. Sub-Saharan Africa generally has low to middle firearms per 100, while other regions vary more widely. The United States is a big outliter. Since there are many countries with near 100 literacy rates, but are all across the graph, literacy and firearms are not that highly correlated.

In [833]:
alt.Chart(merged).mark_square(size = 200).encode(
    x=alt.X('Firearms per 100:N', scale=alt.Scale(zero=False)),
    y=alt.Y('Literacy:Q', scale=alt.Scale(zero=False)),
    color='Region:N',
    tooltip=['Firearms per 100:N', 'Literacy:N'],
).properties(
    width=1200, height=600)

Now, I am curious to check Firearms and Infant mortality. There is a trend of negative correlation between the two. The US as an outlier is extremely magnified with this chart. 

In [829]:
alt.Chart(merged).mark_square().encode(
    x=alt.X('Firearms per 100:Q', scale=alt.Scale(zero=False)),
    y=alt.Y('Infant mortality:Q', scale=alt.Scale(zero=False)),
    color='Region:N',
    tooltip='Country:N',
    size = 'Firearms per 100'
).properties(
    width=1200, height=600
)

Most countries have a low firearms per 100 statistic and all regions are represented in the first bin of 0 to 20 firearms per 100. 

In [645]:
alt.Chart(merged).mark_bar().encode(
    alt.X("Firearms per 100:Q", bin=True),
    y='count()',
    color=alt.Color('Region:N'),#, scale=alt.Scale(scheme="plasma")),
    tooltip='Region:N'
)

The data has shown that firearms per 100 are most correlated with GDP and infant mortality. A few things I am curious about that I hope that can be explored more with extraneous data would include using variables that are factorable, such as language, religion, and economic development level. 

I am also curious to find other statistics other than guns to further interpret the data. 