# Logistic Regression

The logistic regression model as described in the README.md file will be used to predict the quality of transit infrastructure based on GDP per capita. In the data file, transit infrastructure is evaluated on a scale from 1 to 7, where 1 is the worst and 7 is the best. There are four types of transportation being evaluated: air, rail, road, and port. For the purposes of logistic regression, I will be taking the average of the four and if the value comes out to above 4, then we will call that good quality transit infrstructure (1). That means if the value is equal to or below 4, then the transit quality is poor (0).

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

### Cleaning Up Data + Isolate Quality of Transit and GDP

In [10]:
#read csv file with all country transit and socioeconomic data
df = pd.read_csv("CountryData.csv")

In [11]:
# clean up data file

# dependent variable: Average of transit quality, 1 if <4, 0 if <=4
# isolate the following independent variables:
#		Quality of air transport infrastructure
#		Quality of port infrastructure
#		Quality of railroad infrastructure
#		Quality of roads
#		GDP per capita

row_ind = [12, 14, 16, 18, 22]
col_filled = ~df.loc[row_ind].isin([".."]).any() # only keep countries with nonempty rows

fil_df = df.loc[row_ind, col_filled] # create new data table with isolated rows
fil_df.head()

Unnamed: 0,Time,Time Code,Indicator Name,Indicator Code,Albania [ALB],Algeria [DZA],Angola [AGO],Argentina [ARG],Armenia [ARM],Australia [AUS],...,Turkey [TUR],Uganda [UGA],Ukraine [UKR],United Kingdom [GBR],United States [USA],Uruguay [URY],"Venezuela, RB [VEN]",Vietnam [VNM],Zambia [ZMB],Zimbabwe [ZWE]
12,2013,YR2013,Quality of air transport infrastructure [value...,QA.AIR.TRANS.IN,4.31,3.01,3.38,3.56,4.53,5.56,...,5.53,3.58,3.84,5.61,5.95,4.25,2.99,4.04,3.54,3.32
14,2013,YR2013,Quality of port infrastructure [value: 1 = wor...,QA.PORT.TRANS.IN,3.47,2.7,2.94,3.67,3.05,4.97,...,4.34,3.41,3.71,5.68,5.67,4.7,2.53,3.68,3.49,4.09
16,2013,YR2013,Quality of railroad infrastructure [value: 1 =...,QA.RAIL.TRANS.IN,1.18,2.33,1.67,1.7,2.6,4.14,...,3.12,1.53,4.47,5.01,4.89,1.23,1.58,2.97,2.11,2.27
18,2013,YR2013,Quality of roads [value: 1 = worst to 7 = best],QA.ROAD.TRANS.IN,3.85,3.29,2.35,3.07,3.68,4.94,...,4.86,3.04,2.14,5.31,5.68,3.49,2.65,3.08,3.37,3.28
22,2013,YR2013,GDP per capita (current US$),NY.GDP.PCAP.CD,4413.063383,5979.60139,5057.747878,12963.67577,3680.166922,68190.701,...,12578.18786,818.2854274,4129.896973,43426.29814,53409.75078,18335.25948,12403.1467,2359.517365,1820.718548,1362.300668


In [12]:
# transform data file and delete time/labels for correlations
df_transposed = fil_df.T
df_transposed.columns = df_transposed.iloc[2]
df_corr = df_transposed.iloc[4:,:]
df_corr.iloc[:, 1:] = df_corr.iloc[:, 1:].astype(float)

df_corr['infra_avg'] = df_corr[[
    'Quality of air transport infrastructure [value: 1 = worst to 7 = best]',
    'Quality of port infrastructure [value: 1 = worst to 7 = best]',
    'Quality of railroad infrastructure [value: 1 = worst to 7 = best]',
    'Quality of roads [value: 1 = worst to 7 = best]'
]].mean(axis=1)

ValueError: could not convert string to float: 'Singapore [SGP]'