# Info Material

Number of districts per state: https://ballotpedia.org/Population_represented_by_state_legislators

## Unzipping all files

import zipfile
import os
path = "drive/MyDrive/US Elections/individual_states"

for file in os.listdir(path):
  if file.endswith(".zip"):
    filepath = f"{path}/{file}"
    print(filepath)
    with zipfile.ZipFile(filepath,"r") as zip_ref:
      zip_ref.extractall(filepath[:-4])
  else:
    pass

zip_file_paths = []
for file in os.listdir(path):
  if file.endswith(".zip"):
    zip_file_paths.append(path+"/"+file)

for filepath in zip_file_paths:
  os.remove(filepath)

# Read in data

In [1]:
!pip install datatable

import datatable as dt
import pandas as pd

Collecting datatable
  Downloading datatable-1.0.0-cp37-cp37m-manylinux_2_12_x86_64.whl (96.9 MB)
[K     |████████████████████████████████| 96.9 MB 102 kB/s 
[?25hInstalling collected packages: datatable
Successfully installed datatable-1.0.0


In [2]:
%%time
data = dt.fread("drive/MyDrive/US Elections/dataverse_files/2016-precinct-state.csv").to_pandas()

KeyboardInterrupt: ignored

In [None]:
data.shape

In [None]:
pd.set_option("display.max_columns", None)
data.head()

In [None]:
state_house = data[data["office"] == "State House"]

In [None]:
# Delete all rows where votes == 0
state_house = state_house[state_house.votes > 0]

In [None]:
# Rename to Independent: Write-in, All Others and "" 
state_house.loc[state_house[state_house["candidate"].isin(["[Write-in]", "All Others", ""])].index, "party"] = "Independent"

In [None]:
# Delete candidates: Affidavit, Absentee, Scattering, SCATTERING and Federal
state_house.drop(
    state_house[state_house["candidate"].isin(["Affidavit", "Absentee/Military", "Scattering", "SCATTERING", "Federal"])].index, 
    inplace=True)

In [None]:
state_house.head()

In [None]:
state_house.shape

In [None]:
cat_columns = state_house.select_dtypes(exclude=["number"]).columns
state_house[cat_columns] = state_house[cat_columns].applymap(lambda x:x.lower() if type(x) == str else x)
state_house.head()

## Check district count

According to Ballotpedia there are 4828 house districts, where a total of 5411 representatives get elected. The elections of those seats aren't all at the same time. Although 2020 was the presidential election and most of the seats are allocated in that election, I expect the number to be a little lower.

Ballotpredia Link: https://ballotpedia.org/State_Legislative_Districts

In [None]:
state_house.precinct.nunique()

In [None]:
state_house.groupby("state")["district"].nunique().sum()

## Check for number of states

In [None]:
state_house.groupby("state")["precinct"].nunique().sum()

In [None]:
print(state_house.state.nunique())
print(state_house.state.sort_values().unique())

**missing states:** Alabama, Louisiana, Maryland, Nebraska <br>
**States without election:** Louisiana, Mississippi, New Jersey, Virginia, Alabama, and Maryland
* Mississippi had run-off election -> kein Sieger bei eigentlicher Wahl Anfang November (Parteien stimmen nicht)
  * Sieger "Donnie Scoggin" ist Republikaner, district war 89
* New Jersey waren auch run-off elections: robert karabinchak = democrat, camille ferraro clark = republican
* Virginia auch sepcial election. sonst ok

In [None]:
# Donnie Scoggin = republican
state_house.loc[state_house[state_house["candidate"] == "donnie scoggin"].index, "party"] = "republican"

# Robert Karabinchak = democrat
state_house.loc[state_house[state_house["candidate"] == "robert karabinchak"].index, "party"] = "democratic"

# Camille Ferraro Clark = republican
state_house.loc[state_house[state_house["candidate"] == "camille ferraro clark"].index, "party"] = "republican"

#### Fill in all missing districts

In [None]:
# Delaware NA districts drop (two instances with a total of 3 votes for "others")
state_house.drop(state_house[(state_house["district"] == "") & (state_house["candidate"] == "other")].index, inplace=True)

# Mississippi
state_house.loc[state_house[(state_house["district"] == "") & 
                            (state_house["candidate"].isin(["travis mac haynes", "donnie scoggin", "ron swindall"]))].index, "district"] = "89"

state_house.loc[state_house[(state_house["district"] == "") & 
                            (state_house["candidate"].isin(["john glen corley", "larry d davis", "greg holcomb", 
                                                            "ben winston", "daniel wise"]))].index, "district"] = "106"

In [None]:
#ebel (district=merrimack 5), keans (district=strafford 23), wall (district=strafford 6

## Data Cleaning Steps

* Clean wrong party infos to either democrats or repbulicans
* Change local democratic parties to "democrat"
* Summarize all indpendent candidates under the same "party" -> independent
* Delete Conservative and Libertarian candidates as they win too few seats
* Check for candidates -> delete all blank ballots, under-/overvoted entries
* Delete all other parties than the three to predict
* Delete districts which elect more than one representative

In [None]:
# Delete candidate columns
state_house = state_house.iloc[:, :-15]

In [None]:
df_info = pd.DataFrame()
df_info["Data Type"] = state_house.dtypes
df_info["Missing Values"] = state_house.isna().sum()
df_info["No. Unique Values"] = state_house.nunique()
df_info

#### Delete all districts with with more than one elected representative

arizona, new hampshire, north and south dakota, vermont

* delete arizona, north dakota, south dakota (except 26a/b, 28a/b), 
* new hampshire and vermont no districts --> add after creating new dataframe
  * https://ballotpedia.org/Vermont_House_of_Representatives_elections,_2016
  * https://ballotpedia.org/New_Hampshire_House_of_Representatives_elections,_2016
  * https://ballotpedia.org/Georgia_House_of_Representatives_elections,_2016
  * https://ballotpedia.org/Kentucky_House_of_Representatives_elections,_2016

In [None]:
state_house.drop(
    state_house[state_house["state"].isin(["arizona", "north dakota", "new hampshire", "vermont", "georgia", "kentucky", "arkansas"])].index,
    inplace=True)

state_house.drop(
    state_house[(state_house["state"] == "south dakota") & 
                (~state_house["district"].isin(["26a", "26b", "28a", "28b"]))].index,
    inplace=True)                 

### Clean Parties

#### Clean candidates
california, edward fuller = independent

colorado
* doug miracle, mary parker = independent

conneticut
* charles jackson, david g. lapointe = republican
* ty perry = ?

delaware, other = independent

idaho, greg pruett = republican

illinois
* brandi mcguire, tony m. mccombie = republican
* carol ammons, christine law, michael w. halpin, mike smiddy = democratic

indiana, sean eberhart = republican<br>
kansas, kelley, kasha = republican<br>

maine
* bradstreet = republican
* doore = democratic

michigan
* cottrell, robert = republican
* mcgrath, beth, simpson, danetta l.= independent
* stadler, chuck = democratic

new mexico
* antonio ""moe"" maestas, patricia ""patty"" a lundstrom, roberto ""bobby"" jesse gonzales = democratic
* lorenzo a larrañaga = republican

new york
* barbara baer = democratic
* evan mcmullin = independent
* joe errigo, joseph giglio, philip palmesano = republican

north carolina
* terri m. johnson = democratic
* write-in (miscellaneous) = independent

ohio
* andrea mahone, douglas p. crowl, keith hatton, t. brewer, m. deborah tunstall, napoleon a. bell ii, stephen r. spoonamore, timothy peter grady = independent
* jeff brown = democratic

tennessee
* amberlee brooks = republican
* andrew newman = ?
* john r. huff sr. = ?
* rhonda lynnese gallman = ?
* sibyl reagan = ?

texas
* owens, raymond = democratic

utah, gordon jones = ?

washington, david d. schirle = republican

In [None]:
def assign_party(df, state, candidate, party):
  df.loc[df[(df["state"] == state) & (df["candidate"].isin(candidate))].index, "party"] = party

In [None]:
assign_party(state_house, "california", ["edward fuller"], "independent")
assign_party(state_house, "colorado", ["doug miracle", "mary parker"], "independent")
assign_party(state_house, "conneticut", ["charles jackson", "david g. lapointe"], "republican")
assign_party(state_house, "delaware", ["other"], "independent")
assign_party(state_house, "idaho", ["greg pruett"], "republican")
assign_party(state_house, "illinois", ["brandi mcguire", "tony m. mccombie"], "republican")
assign_party(state_house, "illinois", ["carol ammons", "christine law", "michael w. halpin", "mike smiddy"], "democratic")
assign_party(state_house, "indiana", ["sean eberhart"], "republican")
assign_party(state_house, "kansas", ["kelley, kasha"], "republican")
assign_party(state_house, "maine", ["doore"], "democratic")
assign_party(state_house, "maine", ["bradstreet"], "republican")
assign_party(state_house, "michigan", ["cottrell, robert"], "republican")
assign_party(state_house, "michigan", ["stadler, chuck"], "democratic")
assign_party(state_house, "michigan", ["mcgrath, beth", "simpson, danetta l."], "independent")
assign_party(state_house, "new mexico", ['antonio ""moe"" maestas', 'patricia ""patty"" a lundstrom', 'roberto ""bobby"" jesse gonzales'], "democratic")
assign_party(state_house, "new mexico", ["lorenzo a larrañaga"], "republican")
assign_party(state_house, "new york", ["barbara baer"], "democratic")
assign_party(state_house, "new york", ["evan mcmullin"], "independent")
assign_party(state_house, "new york", ["joe errigo", "joseph giglio", "philip palmesano"], "republican")
assign_party(state_house, "north carolina", ["terri m. johnson"], "democratic")
assign_party(state_house, "north carolina", ["write-in (miscellaneous)"], "independent")
assign_party(state_house, "ohio", ["jeff brown"], "democratic")
assign_party(state_house, "ohio", ["andrea mahone", "douglas p. crowl", "keith hatton", "luke t. brewer", "m. deborah tunstall", "napoleon a. bell ii", 
                                   "stephen r. spoonamore", "timothy peter grady"], "independent")
assign_party(state_house, "tennessee", ["amberlee brooks"], "republican")
assign_party(state_house, "texas", ["owens", "raymond"], "democratic")
assign_party(state_house, "washington", ["david d. schirle"], "republican")

#### Clean Parties

In [None]:
democratic = ["independent/democratic", "democrat, independent", "democrat, working families", 
              "democrat, independent, working families", "democrat, independent, republican", 
              "democrat, working families, independent", "democrat, republican, independent", 
              "democratic-farmer-labor", "dfw", "dcr", "republican, democrat"]
republican = ["republican, democrat, independent", "republican, independent, democrat", "republican, independent", 
              "republican, libertarian", "bottom line"]
independent = ["nav", "independent, republican", "independent, pacific green, progressive", "independent, libertarian", 
               "none", "nominated by petition", "unenrolled", "unaffiliated", "no affiliation", "kmi", "nop", "par", 
               "id"]
libertarian = ["libertarian, independent", "lib"]
pacific_green = ["pacific green, progressive"]

In [None]:
state_house.party.nunique()

In [None]:
state_house.loc[state_house[state_house["party"].isin(democratic)].index, "party"] = "democratic"
state_house.loc[state_house[state_house["party"].isin(republican)].index, "party"] = "republican"
state_house.loc[state_house[state_house["party"].isin(independent)].index, "party"] = "independent"
state_house.loc[state_house[state_house["party"].isin(libertarian)].index, "party"] = "libertarian"
state_house.loc[state_house[state_house["party"].isin(pacific_green)].index, "party"] = "pacific green"

In [None]:
state_house.party.nunique()

**party = democratic/republican** 
* republican = roland s marsico, matthew e baker, karen boback, jerome p knowles, james e marshall, harold a english, gary w day, carl walker metzgar, 
* democratic = mark a longietti, h scott conklin, christopher sainato, anita astorino kulik, ebel (district=merrimack 5), keans (district=strafford 23), wall (district=strafford 6)

sami al-abdrabbuh = democratic

**independence**
* republican = mike hurley, daby carreras, daniel stec, gary finch, heather tarrant, jonathan kostakopoulos, nicole malliotakis, peter lopez, rebecca harary, ronald castorina jr., steven mclaughlin
* democratic = carrie woerner, didi barrett, edward braunstein, erik martin dilan, michael cusick, peter abbate jr., steven cymbrowitz

In [None]:
democratic_republican = ["mark a longietti", "h scott conklin", "christopher sainato", "anita astorino kulik", 
                         "ebel", "keans", "wall"]
republican_democratic = ["ronald s marsico", "matthew e baker", "karen boback", "jerome p knowles", "james e marshall", 
                         "harold a english", "gary w day", "carl walker metzgar"]

republican_independence = ["mike hurley", "daby carreras", "daniel stec", "gary finch", "heather tarrant", 
                           "jonathan kostakopoulos", "nicole malliotakis", "peter lopez", "rebecca harary", 
                           "ronald castorina jr.", "steven mclaughlin"]
democratic_independence = ["carrie woerner", "didi barrett", "edward braunstein", "erik martin dilan", 
                           "michael cusick", "peter abbate jr.", "steven cymbrowitz"]

#sami al-abdrabbuh = democratic
state_house.loc[state_house.query("candidate == 'sami al-abdrabbuh'").index, "party"] = "democratic"

# ramiro valderrama == republican
state_house.loc[state_house.query("candidate == 'ramiro valderrama'").index, "party"] = "republican"

# Party democratic/republican to either democratic or republican
state_house.loc[state_house[(state_house["party"] == "democratic/republican") & 
                            (state_house["candidate"].isin(democratic_republican))].index, "party"] = "democratic"

state_house.loc[state_house[(state_house["party"] == "democratic/republican") & 
                            (state_house["candidate"].isin(republican_democratic))].index, "party"] = "republican"

# party independence to either democratic or republican
state_house.loc[state_house[(state_house["party"] == "independence") & 
                            (state_house["candidate"].isin(democratic_independence))].index, "party"] = "democratic"

state_house.loc[state_house[(state_house["party"] == "independence") & 
                            (state_house["candidate"].isin(republican_independence))].index, "party"] = "republican"

# "ebel" (district=merrimack 5), "keans" (district=strafford 23), "wall" (district=strafford 6)
state_house.loc[state_house.query("candidate == 'ebel'").index, "district"] = "merrimack 5"
state_house.loc[state_house.query("candidate == 'keans'").index, "district"] = "strafford 23"
state_house.loc[state_house.query("candidate == 'wall'").index, "district"] = "strafford 6"

# rep / tcn and dem / democrat to republican and democratic
state_house.loc[state_house.query("party == 'rep'").index, "party"] = "republican"
state_house.loc[state_house.query("party == 'tcn'").index, "party"] = "republican"
state_house.loc[state_house.query("party == 'dem'").index, "party"] = "democratic"
state_house.loc[state_house.query("party == 'democrat'").index, "party"] = "democratic"

# nonpartisan to independent
state_house.loc[state_house.query("party == 'nonpartisan'").index, "party"] = "independent"

In [None]:
state_house.party.value_counts()

### Delete all other parties than the three to predict (Dem, Rep, Ind)

In [None]:
# Delete all other parties than democrat, independent or republican
state_house["party"].unique()

In [None]:
state_house.shape

In [None]:
state_house.drop(
    state_house[~state_house["party"].isin(["democratic", "republican", "independent"])].index,
    inplace=True)

### State house winners per district

In [None]:
test_house_district = state_house.groupby(["state", "district", "candidate", "party"])["votes"].sum()
test_house_district

In [None]:
test_house_district["alaska"]["7"].idxmax()[1]

In [None]:
%%time
winners_per_district = []
for state_col in state_house["state"].sort_values().unique():
  for distcol in state_house[state_house["state"] == state_col]["district"].unique():
      winners_per_district.append(test_house_district[state_col][distcol].idxmax()[1])

In [None]:
from collections import Counter
Counter(winners_per_district)

### Save DataFrame

In [None]:
state_house.to_csv("drive/MyDrive/US Elections/state_house_2016.csv", index=False)

## Next Steps
* Create target variable -> create new dataframe

> In 2016, 44 states held state legislative elections; 86 of the 99 chambers were up for election. Only six states did not hold state legislative elections: Louisiana, Mississippi, New Jersey, Virginia, Alabama, and Maryland. [https://en.wikipedia.org/wiki/2016_United_States_elections#State_elections%5D]
  * look for new jersey and virignia in state_house data
    * both are included. But in both states just one vacant seat is filled. -> Special election
  * look for indiana, Nebraska, Oregon data
    * Indiana -> must be handcoded
    * Nebraska -> no lower chamber - unicameral legislature (senate)
    * Oregon -> must be handcoded

#### Load Data

In [7]:
import pandas as pd
import numpy as np
state_house = pd.read_csv("drive/MyDrive/US Elections/state_house_2016.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [8]:
state_house.shape

(313464, 22)

In [9]:
# Delaware NA districts drop (two instances with a total of 3 votes for "others")
state_house.drop(state_house[(state_house["district"] == "") & (state_house["candidate"] == "other")].index, inplace=True)

# Mississippi
state_house.loc[state_house[(state_house["district"] == "") & 
                            (state_house["candidate"].isin(["travis mac haynes", "donnie scoggin", "ron swindall"]))].index, "district"] = "89"

state_house.loc[state_house[(state_house["district"] == "") & 
                            (state_house["candidate"].isin(["john glen corley", "larry d davis", "greg holcomb", 
                                                            "ben winston", "daniel wise"]))].index, "district"] = "106"

In [10]:
state_house.head()

Unnamed: 0,year,stage,special,state,state_postal,state_fips,state_icpsr,county_name,county_fips,county_ansi,...,jurisdiction,precinct,candidate,candidate_normalized,office,district,writein,party,mode,votes
0,2016,gen,False,alaska,ak,2,81,,,,...,district 1,01-446 aurora,[write-in],in,state house,1,True,independent,election day,74
1,2016,gen,False,alaska,ak,2,81,,,,...,district 1,01-446 aurora,"kawasaki, scott j.",kawasaki,state house,1,False,democratic,election day,633
2,2016,gen,False,alaska,ak,2,81,,,,...,district 1,01-455 fairbanks no. 1,[write-in],in,state house,1,True,independent,election day,17
3,2016,gen,False,alaska,ak,2,81,,,,...,district 1,01-455 fairbanks no. 1,"kawasaki, scott j.",kawasaki,state house,1,False,democratic,election day,142
4,2016,gen,False,alaska,ak,2,81,,,,...,district 1,01-465 fairbanks no. 2,[write-in],in,state house,1,True,independent,election day,25


#### Check for unopposed elections

In [11]:
test_house_district = state_house.groupby(["state", "district", "candidate", "party"])["votes"].sum()

In [12]:
unopposed_elections = []
for state in state_house["state"].sort_values().unique():
  for district in state_house[state_house["state"] == state]["district"].unique():
    if test_house_district[state][district].nunique() < 2:
      unopposed_elections.append([state, district])

In [13]:
print(f"There are {len(unopposed_elections)} unopposed elections.")

There are 776 unopposed elections.


In [14]:
state_house["state"].sort_values().unique()

array(['alaska', 'california', 'colorado', 'connecticut', 'delaware',
       'florida', 'hawaii', 'idaho', 'illinois', 'indiana', 'iowa',
       'kansas', 'maine', 'massachusetts', 'michigan', 'minnesota',
       'mississippi', 'missouri', 'montana', 'nevada', 'new jersey',
       'new mexico', 'new york', 'north carolina', 'ohio', 'oklahoma',
       'oregon', 'pennsylvania', 'rhode island', 'south carolina',
       'south dakota', 'tennessee', 'texas', 'utah', 'virginia',
       'washington', 'west virginia', 'wisconsin', 'wyoming'],
      dtype=object)

## Create new DataFrame

In [15]:
test_house_district

state    district  candidate             party      
alaska   1         [write-in]            independent     476
                   kawasaki, scott j.    democratic     4376
         2         [write-in]            independent      42
                   holdaway, truno n. l  democratic     1153
                   thompson, steve m.    republican     3268
                                                        ... 
wyoming  7         sue wilson            republican     4782
         8         bob nicholas          republican     2570
                   linda burt            democratic     1941
         9         landon brown          republican     2299
                   mike weiland          democratic     1639
Name: votes, Length: 6534, dtype: int64

### Create target, district and state variable 

In [16]:
state_house[state_house["state"] == "arkansas"]["district"].unique()

array([], dtype=object)

In [17]:
target_list = []
district_list = []
state_list = []

for state in state_house["state"].sort_values().unique():
  # Makes list of all districts
  district_list.extend(state_house[state_house["state"] == state]["district"].unique().tolist())
  
  # Makes list of all states times number of districts
  multiplier = state_house[state_house["state"] == state]["district"].nunique()
  state_list.extend([state] * multiplier)
  
  # Makes list of target variable (-> whether republican, democrat or indpendent)
  for district in state_house[state_house["state"] == state]["district"].unique():
    target_list.append(test_house_district[state][district].idxmax()[1])

In [18]:
len(target_list), len(district_list), len(state_list)

(3245, 3245, 3245)

In [19]:
from collections import Counter
Counter(target_list)

Counter({'democratic': 1444, 'independent': 2, 'republican': 1799})

In [20]:
Counter(state_list)

Counter({'alaska': 40,
         'california': 80,
         'colorado': 65,
         'connecticut': 151,
         'delaware': 41,
         'florida': 79,
         'hawaii': 30,
         'idaho': 24,
         'illinois': 118,
         'indiana': 100,
         'iowa': 100,
         'kansas': 146,
         'maine': 34,
         'massachusetts': 160,
         'michigan': 110,
         'minnesota': 133,
         'mississippi': 2,
         'missouri': 163,
         'montana': 108,
         'nevada': 42,
         'new jersey': 2,
         'new mexico': 70,
         'new york': 149,
         'north carolina': 120,
         'ohio': 99,
         'oklahoma': 73,
         'oregon': 60,
         'pennsylvania': 203,
         'rhode island': 75,
         'south carolina': 144,
         'south dakota': 4,
         'tennessee': 99,
         'texas': 69,
         'utah': 75,
         'virginia': 2,
         'washington': 49,
         'west virginia': 67,
         'wisconsin': 99,
         'wyoming': 60}

**Next Steps:**
* add office and year
* add Results of New Hampshire, Vermont, Gerogia and Kentucky
  * https://ballotpedia.org/Vermont_House_of_Representatives_elections,_2016
  * https://ballotpedia.org/New_Hampshire_House_of_Representatives_elections,_2016
  * https://ballotpedia.org/Georgia_House_of_Representatives_elections,_2016
  * https://ballotpedia.org/Kentucky_House_of_Representatives_elections,_2016
* Look for data to predict the target on: https://data.census.gov/cedsci/advanced

In [21]:
state_house.head()

Unnamed: 0,year,stage,special,state,state_postal,state_fips,state_icpsr,county_name,county_fips,county_ansi,...,jurisdiction,precinct,candidate,candidate_normalized,office,district,writein,party,mode,votes
0,2016,gen,False,alaska,ak,2,81,,,,...,district 1,01-446 aurora,[write-in],in,state house,1,True,independent,election day,74
1,2016,gen,False,alaska,ak,2,81,,,,...,district 1,01-446 aurora,"kawasaki, scott j.",kawasaki,state house,1,False,democratic,election day,633
2,2016,gen,False,alaska,ak,2,81,,,,...,district 1,01-455 fairbanks no. 1,[write-in],in,state house,1,True,independent,election day,17
3,2016,gen,False,alaska,ak,2,81,,,,...,district 1,01-455 fairbanks no. 1,"kawasaki, scott j.",kawasaki,state house,1,False,democratic,election day,142
4,2016,gen,False,alaska,ak,2,81,,,,...,district 1,01-465 fairbanks no. 2,[write-in],in,state house,1,True,independent,election day,25


In [60]:
def create_dataframe(state_list, district_list, target_list):
  df = pd.DataFrame({
    "state": state_list,
    "district": district_list,
    "office": ["state house"] * len(state_list),
    "year": [2016] * len(state_list),
    "target": target_list})
  return df

In [61]:
df = create_dataframe(state_list, district_list, target_list)
df.head()

Unnamed: 0,state,district,office,year,target
0,alaska,1,state house,2016,democratic
1,alaska,2,state house,2016,republican
2,alaska,3,state house,2016,republican
3,alaska,4,state house,2016,democratic
4,alaska,5,state house,2016,democratic


### Adding New Hampshire, Vermont, Gerogia, Kentucky and Arkansas to the DataFrame

In [62]:
df.target.value_counts()

republican     1799
democratic     1444
independent       2
Name: target, dtype: int64

In [63]:
# New Hampshire
district_nh = ["belknap 1", "belknap 7", "carroll 1", "cheshire 2", "cheshire 3", "cheshire 4", 
               "cheshire 5", "cheshire 6", "cheshire 7", "cheshire 8", "cheshire 10", "cheshire 13", 
               "coos 2", "coos 4", "coos 5", "coos 6", "grafton 2", "grafton 3", "grafton 4", 
               "grafton 5", "grafton 6", "grafton 7", "grafton 10", "grafton 11", "hillsborough 3", 
               "merrimack 1", "merrimack 4", "merrimack 7", "merrimack 8", "merrimack 11", "merrimack 12", 
               "merrimack 13", "merrimack 14", "merrimack 15", "merrimack 16", "merrimack 17", 
               "merrimack 18", "merrimack 19", "merrimack 22", "rockingham 1", "rockingham 10", 
               "rockingham 11", "rockingham 12", "rockingham 15", "rockingham 16", "rockingham 22", 
               "rockingham 23", "rockingham 25", "rockingham 26", "rockingham 27", "rockingham 28", 
               "rockingham 29", "strafford 5", "strafford 7", "strafford 8", "strafford 9", "strafford 10", 
               "strafford 11", "strafford 12", "strafford 13", "strafford 14", "strafford 15", 
               "strafford 16", "sullivan 2", "sullivan 3", "sullivan 4", "sullivan 5", "sullivan 7", "sullivan 8"]
state_nh = ["new hampshire"] * len(district_nh)
# republicans: 0, democrats: 1, independents: 2
target_nh = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1,
             1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1,
             1, 1, 0, 0, 1, 0, 0]
nh = create_dataframe(state_nh, district_nh, target_nh)
nh.head()

Unnamed: 0,state,district,office,year,target
0,new hampshire,belknap 1,state house,2016,0
1,new hampshire,belknap 7,state house,2016,0
2,new hampshire,carroll 1,state house,2016,0
3,new hampshire,cheshire 2,state house,2016,1
4,new hampshire,cheshire 3,state house,2016,1


In [64]:
# Vermont
district_vermont = ["addison-2", "addison-5", "addison-rutland", "bennington-1", "bennington-3", "bennington-rutland",
                    "caledonia-1", "caledonia-2", "caledonia-washington", "chittenden-1", "chittenden-4-1",
                    "chittenden-4-2", "chittenden-5-1", "chittenden-5-2", "chittenden-6-2", "chittenden-6-6", 
                    "chittenden-7-1", "chittenden-7-2", "chittenden-7-3", "chittenden-7-4", "chittenden-8-3",
                    "essex-caledonia", "essex-caledonia-orleans", "franklin-1", "franklin-2", "franklin-3-2", 
                    "franklin-6", "franklin-7", "lamoille-1", "lamoille-3", "orange-2", "orange-caledonia", 
                    "orleans-lamoille", "rutland-1", "rutland-4", "rutland-5-1", "rutland-5-2", "rutland-5-3",
                    "rutland-5-4", "rutland-bennington", "rutland-windsor-1", "rutland-windsor-2", "washington-5", 
                    "washington-6", "windham-1", "windham-2-1", "windham-2-2", "windham-2-3", "windham-5", "windham-6",
                    "windham-bennington", "windham-bennington-windsor", "windsor-2", "windsor-3-1", "windsor-4-1", 
                    "windsor-5", "windsor-orange-1", "windsor-rutland"]
state_vermont = ["vermont"] * len(district_vermont)
target_vermont = [1, 0, 2, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 2, 0, 1, 2, 0, 0, 1, 1, 0, 0, 
                  0, 0, 0, 1, 0, 2, 0, 0, 1, 1, 0, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 0, 2]

vermont = create_dataframe(state_vermont, district_vermont, target_vermont)
vermont.head()

Unnamed: 0,state,district,office,year,target
0,vermont,addison-2,state house,2016,1
1,vermont,addison-5,state house,2016,0
2,vermont,addison-rutland,state house,2016,2
3,vermont,bennington-1,state house,2016,1
4,vermont,bennington-3,state house,2016,1


In [65]:
# Georgia
state_georgia = ["georgia"] * 180
district_georgia = np.arange(1, 181)
# republicans: 0, democrats: 1, independents: 2
target_georgia = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                  0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 
                  1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 
                  1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 
                  0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 
                  0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 
                  0, 0, 1, 0, 0, 0]

georgia = create_dataframe(state_georgia, district_georgia, target_georgia)
georgia.head()

Unnamed: 0,state,district,office,year,target
0,georgia,1,state house,2016,0
1,georgia,2,state house,2016,0
2,georgia,3,state house,2016,0
3,georgia,4,state house,2016,0
4,georgia,5,state house,2016,0


In [66]:
# Kentucky
state_kentucky = ["kentucky"] * 100
district_kentucky = np.arange(1, 101)
# republicans: 0, democrats: 1, independents: 2
target_kentucky = [0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 
                   1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 
                   0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 
                   0, 0, 1, 1, 0, 0, 0, 0, 1, 1]

kentucky = create_dataframe(state_kentucky, district_kentucky, target_kentucky)
kentucky.head()

Unnamed: 0,state,district,office,year,target
0,kentucky,1,state house,2016,0
1,kentucky,2,state house,2016,0
2,kentucky,3,state house,2016,1
3,kentucky,4,state house,2016,0
4,kentucky,5,state house,2016,0


In [67]:
# Arkansas
state_arkansas = ["arkansas"] * 100
district_arkansas = np.arange(1, 101)
# republicans: 0, democrats: 1, independents: 2
target_arkansas = [0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
                   0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 
                   1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 
                   0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

arkansas = create_dataframe(state_arkansas, district_arkansas, target_arkansas)
arkansas.head()

Unnamed: 0,state,district,office,year,target
0,arkansas,1,state house,2016,0
1,arkansas,2,state house,2016,0
2,arkansas,3,state house,2016,0
3,arkansas,4,state house,2016,0
4,arkansas,5,state house,2016,1


In [68]:
# Add these two states to df
df = pd.concat([df, nh, vermont, georgia, kentucky, arkansas])
df.tail()

Unnamed: 0,state,district,office,year,target
95,arkansas,96,state house,2016,0
96,arkansas,97,state house,2016,0
97,arkansas,98,state house,2016,0
98,arkansas,99,state house,2016,0
99,arkansas,100,state house,2016,0


In [69]:
df.shape

(3752, 5)

In [70]:
### Change target variable to numeric
target_values = {"republican": 0, "democratic": 1, "independent": 2}
df.replace({"target": target_values}, inplace=True)
df.head()

Unnamed: 0,state,district,office,year,target
0,alaska,1,state house,2016,1
1,alaska,2,state house,2016,0
2,alaska,3,state house,2016,0
3,alaska,4,state house,2016,1
4,alaska,5,state house,2016,1


In [71]:
# Sort dataframe by state names and save it to csv
# df.sort_values(["state", "district"], inplace=True)
df.to_csv("drive/MyDrive/US Elections/target_ready_2016.csv", index=False)

## Adding new data from census.gov

In [36]:
import pandas as pd
import numpy as np

df = pd.read_csv("/content/drive/MyDrive/US Elections/target_ready_2016.csv")

In [37]:
df.tail()

Unnamed: 0,state,district,office,year,target
3783,arkansas,96,state house,2016,0
3784,arkansas,97,state house,2016,0
3785,arkansas,98,state house,2016,0
3786,arkansas,99,state house,2016,0
3787,arkansas,100,state house,2016,0


In [39]:
df[df["state"] == "new hampshire"]

Unnamed: 0,state,district,office,year,target
3245,new hampshire,belknap 1,state house,2016,0
3246,new hampshire,belknap 7,state house,2016,0
3247,new hampshire,belknap 8,state house,2016,0
3248,new hampshire,belknap 9,state house,2016,0
3249,new hampshire,carroll 1,state house,2016,0
...,...,...,...,...,...
3345,new hampshire,sullivan 7,state house,2016,0
3346,new hampshire,sullivan 8,state house,2016,0
3347,new hampshire,sullivan 9,state house,2016,1
3348,new hampshire,sullivan 10,state house,2016,1


In [None]:
df["state"].unique()

In [None]:
for state in df["state"].unique():
  print(state, df[df["state"] == state]["district"].nunique())

* Arkansas: 012, 045, 046 
* Delaware: additional district called "State House District not defined, Delaware"
* Florida: 013, 014, 020, 043, 035, 036, 061, 063, 070, 094, 095, 096, 097, 098, 099, 100, 102, 107, 108, 109, 113, 117
* Hawaii: 006, 007, 008, 012, 021, 023, 025, 026, 027, 028, 029, 031, 032, 038, 042, 046, 048
* Massachusetts: evtl. alle State House Districts überzählig + "Barnstable, Dukes & Nantrucket District, Massachusetts"
* Minnesota: completely different names -> have to compare them graphically
* 

In [72]:
df[df["state"] == "florida"]

Unnamed: 0,state,district,office,year,target
377,florida,21,state house,2016,0
378,florida,10,state house,2016,0
379,florida,5,state house,2016,0
380,florida,6,state house,2016,0
381,florida,19,state house,2016,0
...,...,...,...,...,...
451,florida,28,state house,2016,0
452,florida,29,state house,2016,0
453,florida,26,state house,2016,1
454,florida,27,state house,2016,0


In [None]:
df[df[""]]