**US House of Representatives Analysis**

*Objectives:*
1) Load Collected Data from API for Analysis ("https://house-stock-watcher-data.s3-us-west-2.amazonaws.com/data/all_transactions.json")
2) Parse and Clean
3) Better understand and brainstorm analysis ideas.

In [152]:
#load dependencies
import pandas as pd

*Data Collection and Exploration*

In [153]:
#load dataset
data = pd.read_csv("USA_Congress_Transactions_data.csv")
print(data.head(3))

              amount                                  asset_description  \
0   $1,001 - $15,000                                             BP plc   
1   $1,001 - $15,000                            Exxon Mobil Corporation   
2  $15,001 - $50,000  Industrial Logistics Properties Trust - Common...   

   cap_gains_over_200_usd disclosure_date  disclosure_year district  owner  \
0                   False      10/04/2021             2021     NC05  joint   
1                   False      10/04/2021             2021     NC05  joint   
2                   False      10/04/2021             2021     NC05  joint   

                                            ptr_link      representative  \
0  https://disclosures-clerk.house.gov/public_dis...  Hon. Virginia Foxx   
1  https://disclosures-clerk.house.gov/public_dis...  Hon. Virginia Foxx   
2  https://disclosures-clerk.house.gov/public_dis...  Hon. Virginia Foxx   

  ticker transaction_date      type  
0     BP       2021-09-27  purchase  
1    

In [154]:
print(data.columns)
print(data.shape)

Index(['amount', 'asset_description', 'cap_gains_over_200_usd',
       'disclosure_date', 'disclosure_year', 'district', 'owner', 'ptr_link',
       'representative', 'ticker', 'transaction_date', 'type'],
      dtype='object')
(15433, 12)


*Data Transformation and Validation*

In [155]:
#Analyze uniqueness or rows 
print(data.amount.unique())

['$1,001 - $15,000' '$15,001 - $50,000' '$50,001 - $100,000'
 '$100,001 - $250,000' '$1,001 -' '$250,001 - $500,000'
 '$500,001 - $1,000,000' '$5,000,001 - $25,000,000'
 '$1,000,001 - $5,000,000' '$1,000,000 +' '$1,000 - $15,000'
 '$15,000 - $50,000' '$50,000,000 +' '$1,000,000 - $5,000,000']


There is values in the amount column starting with a 1 dollar difference
'$1,000 - $15,000' and '$1,001 - $15,000' should be the same value 
lets fix for the difference across the entire column 

In [156]:
#Fix amount column
for i in range(0,len(data)):
    index = data['amount'][i].find('01')
    if index != -1:
      data['amount'][i] = list(data['amount'][i])
      data['amount'][i][index:index+2] = '00'
      data['amount'][i] = ''.join(data['amount'][i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['amount'][i] = list(data['amount'][i])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['amount'][i] = ''.join(data['amount'][i])


In [157]:
print(data.amount.unique())
print(data.shape)

['$1,000 - $15,000' '$15,000 - $50,000' '$50,000 - $100,000'
 '$100,000 - $250,000' '$1,000 -' '$250,000 - $500,000'
 '$500,000 - $1,000,000' '$5,000,000 - $25,000,000'
 '$1,000,000 - $5,000,000' '$1,000,000 +' '$50,000,000 +']
(15433, 12)


In [158]:
#Modify representative format for easier assimilation to other data sets 
#Remove Hon. Prefix before representative name 
for i in range(0,len(data)):
    #grab last part of prefix split (First and Last Name )
    first_last_name = data['representative'][i].split(" ")[1:]
    data['representative'][i] = ' '.join(first_last_name)
data['representative'].unique()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['representative'][i] = ' '.join(first_last_name)


array(['Virginia Foxx', 'Alan S. Lowenthal', 'Aston Donald McEachin',
       'Austin Scott', 'Thomas Suozzi', 'Christopher L. Jacobs',
       'Susie Lee', 'TJ John (Tj) Cox', 'Mo Brooks', 'Robert J. Wittman',
       'Vern Buchanan', 'Lois Frankel', 'Michael T. McCaul',
       'Suzan K. DelBene', 'Greg Gianforte', 'Lloyd K. Smucker',
       'Earl Blumenauer', 'James Comer', 'James R. Langevin',
       'John Curtis', 'Trey Hollingsworth', 'Anthony E. Gonzalez',
       'William R. Keating', 'Raúl M. Grijalva', 'Josh Gottheimer',
       'Katherine M. Clark', 'Carolyn B. Maloney', 'Pete Sessions',
       'David B. McKinley', 'Nancy Pelosi', 'Steve Cohen',
       'Gerald E. Connolly', 'Lloyd Doggett', 'David E. Price',
       'Kathy Manning', 'Scott H. Peters', 'Sean Patrick Maloney',
       'Michael K. Simpson', 'Greg Steube',
       'Donald Sternoff Honorable Beyer', 'Mark Dr Green', 'Brian Mast',
       'Mary Gay Scanlon', 'Charles J. "Chuck" Fleischmann', 'Mike Kelly',
       'Marie Newm

In [159]:
#Delete rows with incomplete data 
data = data.dropna()
print(data.shape)

(9521, 12)


In [160]:
#Analysis distric, representative, and type 
data.district.unique()

array(['NC05', 'CA47', 'VA04', 'NV03', 'AL05', 'TX10', 'WA01', 'MT00',
       'OR03', 'KY01', 'FL21', 'RI02', 'UT03', 'AZ03', 'NJ05', 'MA05',
       'WV01', 'CA12', 'TN09', 'VA11', 'NC04', 'NC06', 'CA52', 'FL17',
       'VA08', 'PA05', 'TN03', 'PA16', 'IL03', 'TN07', 'VA01', 'MS03',
       'TX17', 'CA17', 'TX35', 'NY06', 'GA12', 'MI12', 'OK04', 'FL02',
       'CO05', 'IN01', 'FL16', 'GA14', 'CA06', 'MN03', 'VT00', 'NY08',
       'CO07', 'CA19', 'TX11', 'FL18', 'IA03', 'OK01', 'IN03', 'MO02',
       'MD08', 'FL15', 'IL10', 'WA08', 'MI06', 'OH16', 'CA39', 'CA53',
       'KY03', 'TX03', 'FL23', 'CT02', 'PA03', 'SC07', 'OH07', 'TX26',
       'KS01', 'ME01', 'NC02', 'FL12', 'NJ11', 'KS04', 'WI08', 'NJ06',
       'NY27', 'MA06', 'FL14', 'IA01', 'ID02', 'TN01', 'OH05', 'MA03',
       'AZ01', 'GA01', 'NC07', 'IL16', 'FL27', 'SC04', 'IL17', 'TX24',
       'WV03', 'NY12', 'GA08', 'PA09', 'MD06', 'FL04', 'AR02', 'MA04',
       'CA03', 'IN05', 'TN08', 'VA07', 'CA38', 'NY25', 'CA28', 'CA48',
      

In [161]:
data.representative.unique()

array(['Virginia Foxx', 'Alan S. Lowenthal', 'Aston Donald McEachin',
       'Susie Lee', 'Mo Brooks', 'Michael T. McCaul', 'Suzan K. DelBene',
       'Greg Gianforte', 'Earl Blumenauer', 'James Comer', 'Lois Frankel',
       'James R. Langevin', 'John Curtis', 'Raúl M. Grijalva',
       'Josh Gottheimer', 'Katherine M. Clark', 'David B. McKinley',
       'Nancy Pelosi', 'Steve Cohen', 'Gerald E. Connolly',
       'David E. Price', 'Kathy Manning', 'Scott H. Peters',
       'Greg Steube', 'Donald Sternoff Honorable Beyer',
       'Mary Gay Scanlon', 'Charles J. "Chuck" Fleischmann', 'Mike Kelly',
       'Marie Newman', 'Mark Dr Green', 'Robert J. Wittman',
       'Michael Patrick Guest', 'Pete Sessions', 'Rohit Khanna',
       'Lloyd Doggett', 'Grace Meng', 'Richard W. Allen',
       'Debbie Dingell', 'Tom Cole', 'Neal Patrick MD, Facs Dunn',
       'Doug Lamborn', 'Peter J. Visclosky', 'Vern Buchanan',
       'Marjorie Taylor Greene', 'Doris O. Matsui', 'Dean Phillips',
       'Peter 

In [162]:
data.type.unique()

array(['purchase', 'sale_partial', 'sale_full', 'exchange', 'sale'],
      dtype=object)

In [163]:
#Check for duplicate rows 
data.drop_duplicates()
print(data.shape)

(9521, 12)


*Analysis and Visualization*

In [164]:
data.to_csv(".\Cleaned_House_Stock_US.csv",index=False)

Now having cleaned this dataset I would continue my analysis on Tableu Public.

I will search for a dataset of the political parties of each representative to compare Republican and Democratic Investments. 

https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives

However, this list only contains current representatives; for my dataset to be complete I would have to limit my dataset to last two years of office 2021-2022

Before proceeding I would clean up this data set as well prior to analysis.


**List of Current USA House Representatives**

In [165]:
reps = pd.read_csv("List_of_Representatives.csv")
reps.head()


Unnamed: 0,District,Member,Party,Prior experience,Education,Assumed office,Residence,Born[2]
0,Alabama 1,Jerry Carl,Republican,Mobile County Commission,Florida Gateway College,2021,Mobile,"June 17, 1958(age 64)"
1,Alabama 2,Barry Moore,Republican,Alabama House of Representatives,Enterprise State Community College (AS)\r\nAub...,2021,Enterprise,"September 26, 1966(age 56)"
2,Alabama 3,Mike Rogers,Republican,Calhoun County Commissioner\r\nAlabama House o...,"Jacksonville State University (BA, MPA)\r\nBir...",2003,Anniston,"July 16, 1958(age 64)"
3,Alabama 4,Robert Aderholt,Republican,Haleyville Municipal Judge,University of North Alabama\r\nBirmingham–Sout...,1997,Haleyville,"July 22, 1965(age 57)"
4,Alabama 5,Mo Brooks,Republican,Alabama House of Representatives\r\nMadison Co...,Duke University (BA)\r\nUniversity of Alabama ...,2011,Huntsville,"April 29, 1954(age 68)"


We are only interested in the representatives' political party, District, and age. 

In [166]:
#Fix district column 
for i in range(0,len(reps)):
    reps['District'][i] = reps['District'][i].split("\xa0")[0]
reps.District.unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia',
       'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas',
       'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts',
       'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana',
       'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico',
       'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma',
       'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont',
       'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'],
      dtype=object)

In [167]:
#rename Born[2] column 
reps.rename(columns = {'Born[2]':'Birth_date'}, inplace = True)
reps.head()

Unnamed: 0,District,Member,Party,Prior experience,Education,Assumed office,Residence,Birth_date
0,Alabama,Jerry Carl,Republican,Mobile County Commission,Florida Gateway College,2021,Mobile,"June 17, 1958(age 64)"
1,Alabama,Barry Moore,Republican,Alabama House of Representatives,Enterprise State Community College (AS)\r\nAub...,2021,Enterprise,"September 26, 1966(age 56)"
2,Alabama,Mike Rogers,Republican,Calhoun County Commissioner\r\nAlabama House o...,"Jacksonville State University (BA, MPA)\r\nBir...",2003,Anniston,"July 16, 1958(age 64)"
3,Alabama,Robert Aderholt,Republican,Haleyville Municipal Judge,University of North Alabama\r\nBirmingham–Sout...,1997,Haleyville,"July 22, 1965(age 57)"
4,Alabama,Mo Brooks,Republican,Alabama House of Representatives\r\nMadison Co...,Duke University (BA)\r\nUniversity of Alabama ...,2011,Huntsville,"April 29, 1954(age 68)"


In [168]:
#Create Age column from Born[2] column 
for i in range(0,len(reps)):
    if reps['Birth_date'][i] != 'VACANT':
        tmp = reps['Birth_date'][i].split("(")
        tmp = str(tmp[1]).split("\xa0")[1][:-1]
        reps['Birth_date'][i] = tmp
    else:
        #drop row since house position currently VANCANT 
        reps = reps.drop(i)
reps.Birth_date.unique()

array(['64', '56', '57', '68', '49', '76', '72', '74', '63', '60', '42',
       '52', '65', '54', '62', '58', '77', '66', '71', '78', '36', '70',
       '82', '41', '46', '79', '45', '69', '59', '67', '43', '85', '53',
       '47', '50', '81', '61', '84', '48', '33', '38', '35', '40', '34',
       '44', '73', '55', '75', '39', '51', '83', '37', '32', '27', '86',
       '80'], dtype=object)

In [169]:
reps.head()

Unnamed: 0,District,Member,Party,Prior experience,Education,Assumed office,Residence,Birth_date
0,Alabama,Jerry Carl,Republican,Mobile County Commission,Florida Gateway College,2021,Mobile,64
1,Alabama,Barry Moore,Republican,Alabama House of Representatives,Enterprise State Community College (AS)\r\nAub...,2021,Enterprise,56
2,Alabama,Mike Rogers,Republican,Calhoun County Commissioner\r\nAlabama House o...,"Jacksonville State University (BA, MPA)\r\nBir...",2003,Anniston,64
3,Alabama,Robert Aderholt,Republican,Haleyville Municipal Judge,University of North Alabama\r\nBirmingham–Sout...,1997,Haleyville,57
4,Alabama,Mo Brooks,Republican,Alabama House of Representatives\r\nMadison Co...,Duke University (BA)\r\nUniversity of Alabama ...,2011,Huntsville,68


Now having cleaned this dataset I would continue my analysis on Tableu Public.