# 0. Introduction

You're currently using the Internet to read this and you're very thankful for that, but what about those people that do not have regular access to food or medical care? Here we have a dataset which tell us how many people in poverty situation have access to the internet in their homes in the United States. The original data can be found here: https://www.kaggle.com/mmattson/us-broadband-availability. We won't be using the `source_sets` folder.

We want to know which circumstance is the most stopping for people from having Internet access at home, and which state is the most precarious in that matter.

For that, we'll follow these three steps:

* We'll decide which are the most useful attributes and which one we will process.
* We'll briefly describe the computational method applied along with the chosen parameters (!)
* And we'll expose our results.

First of all we'll need to import some libraries:
* Using *numpy* and *pandas* we'll treat our data.
* With *skicit-learn* we'll train and test our models.
* With *matplotlib* we generate some graphics.

In [4]:
#first things first
import numpy as np
import pandas as pd
#import matplotlib.pyplot as plt
#import seaborn as sns

# 1. Data mining

First of all, we read it:

In [8]:
#loading
pd.set_option('display.max_columns', None)
csv = pd.read_csv('../../data/broadband_access.csv')
print("We have", csv.shape[0], "entries.")
print(csv.head())

We have 3142 entries.
                 full_name   county    state state_abr  population  unemp  \
0  Autauga County, Alabama  Autauga  Alabama        AL     55869.0    2.7   
1  Baldwin County, Alabama  Baldwin  Alabama        AL    223234.0    2.7   
2  Barbour County, Alabama  Barbour  Alabama        AL     24686.0    3.8   
3     Bibb County, Alabama     Bibb  Alabama        AL     22394.0    3.1   
4   Blount County, Alabama   Blount  Alabama        AL     57826.0    2.7   

   health_ins  poverty  SNAP  no_comp  no_internet  home_broad  broad_num  \
0         7.1     15.4  12.7     13.0         20.9        78.9        0.0   
1        10.2     10.6   7.5     11.4         21.3        78.1        0.0   
2        11.2     28.9  27.4     23.9         38.9        60.4        4.0   
3         7.9     14.0  12.4     23.7         33.8        66.1        0.0   
4        11.0     14.4   9.5     21.3         30.6        68.5        0.0   

   broad_avail  broad_cost  population_bbn  price_bb

We don't need that many entries. As we said in the introduction, we only want to know information related to states, so we'll have 50 rows of data for each state and it'll be easier to process. We'll create two dataframes since we need to combine some data differently (we'll have to sum the counties population and we'll have to calculate the ratio's mean separately, and then we'll join them).

But do we need all attributes?
* `state_abr` is redundant.
* `county` and `full_name` will be removed since we don't need specific county data, and `full_name` is comprised by `county` and `state`.
* We'll remove the `home_broad` (home broadband) attribute, since it fells redundant because we already have the `no_internet` (no internet at home) one.
* Those that come with a `_bbn` suffix will be eliminated since they repeat already-known data and include unnecessary details.

In [20]:
#Here we'll have one dataframe with only info related to population per state.
df1 = csv[['state', 'population']].copy()
df1 = df1.groupby(['state']).sum()

#And another one will all the other useful info. We are only going to copy the attributes we want.
df2 = csv[['state', 'unemp', 'health_ins', 'poverty', 'SNAP', 'no_comp', 'no_internet', 'broad_num', 'broad_avail', 'broad_cost']].copy()
df2 = df2.groupby(['state']).mean()

#And now we join them.
df = df1.join(df2, lsuffix="_left", rsuffix="_right")
print(df.head())

            population     unemp  health_ins    poverty       SNAP    no_comp  \
state                                                                           
Alabama      4903185.0  3.446269   10.488060  20.337313  17.507463  22.944776   
Alaska        731545.0  8.320690   19.934483  13.317241  15.875862  11.262069   
Arizona      7278717.0  6.446667   11.640000  20.000000  15.953333  16.613333   
Arkansas     3017804.0  4.184000    8.637333  19.833333  15.681333  20.741333   
California  39512223.0  5.098276    7.808621  15.020690  10.517241  10.837931   

            no_internet  broad_num  broad_avail  broad_cost  
state                                                        
Alabama       33.577612   3.409091    57.825758   88.327222  
Alaska        23.951724   5.344828    69.051724   66.427143  
Arizona       27.780000   3.333333    59.046667   59.987333  
Arkansas      35.129333   3.786667    60.314667   60.787733  
California    18.086207   4.982759    76.705172   70.025862 

The `no_internet` (percent with no home internet) attribute is the one that interests us the most, and is the one to be processed.

# 2. Data analysis

TBC