# Using free and open source tools to analyze data from the Federal Trade Commission (FTC)

This notebook shows how to use free open source tools -- Python and Tableau Public -- to analyze Do Not Call program data made available to the public by the Federal Trade Commission.  This notebook is not affiliated with the FTC. 

The data used in the analysis below was taken from https://www.ftc.gov/site-information/open-government/data-sets/do-not-call-data.  It includes Do Not Call and robocall reports to the Federal Trade Commission. The data contains information reported by consumers, including the telephone number originating the unwanted call, the date the complaint was created, the time the call was made, the consumer’s city and state locations reported, the subject of the call, the consumers area code and whether the call was a robocall. 


We will use python to automatically pull data from the web, clean it, and create a data set that can be used to build interactive dashboards with Tableau.  

The dashboards are made available on Tableau Public – a free service that allows users to publish dashboards to the 

The dashboard below is located at: https://public.tableau.com/profile/paul.witt2290#!/
 

<div class='tableauPlaceholder' id='viz1540684925001' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Do&#47;DoNotCallPublicDataSets&#47;ofCallsPerConsumerCity&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='DoNotCallPublicDataSets&#47;ofCallsPerConsumerCity' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Do&#47;DoNotCallPublicDataSets&#47;ofCallsPerConsumerCity&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1540684925001');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='1000px';vizElement.style.height='827px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

## Secton I: Data Wrangling with Python

We start by using the requests python library to access the HTML code from FTC.gov.  http://docs.python-requests.org/en/master/

The requests library has a straight forward API that allows us to easliy request data from FTC.gov. 


Below we create a response object to retrive the web page that contains the data we need. The response object contains a server’s response to an HTTP request. The .get method below initiates an HTTP Get request. 

For more on HTTP requests see https://www.w3schools.com/tags/ref_httpmethods.asp


In [2]:
import requests  

r = requests.get('https://www.ftc.gov/site-information/open-government/data-sets/do-not-call-data') 

We now have a response object that gives us access to the elements of Document Oject Model and allows us to inspect the webpage that contains our data.  

In [25]:
print(r.text[:1000]) 

<!DOCTYPE html>
<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)]><html class="lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><!--><html  lang="en" dir="ltr" prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product# content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#"><!--<![endif]-->

<head p

We only need to access csv files, so most of what we see here is not useful. We could use string operations to search and find what we need but that would be combersome and time consuming.  Instead, we will use the Beautiful Soup python library. The Beautiful Soup API will help us quickly parse the strings on this page to get at what we need. 

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. For our purposes, the ability to quickly seach and access the tags, attributes and elements in the webpage will be necessary to retrieve our data. 

For more on HTML objects see https://www.456bereastreet.com/archive/200508/html_tags_vs_elements_vs_attributes/

We start by importing Beautiful Soup Library. We will pass our response object into a BS HTML parser. 


In [4]:
from bs4 import BeautifulSoup  
soup = BeautifulSoup(r.text, 'html.parser') 

In [32]:
soup.head()

[<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':\n  new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],\n  j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=\n  'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);\n  })(window,document,'script','dataLayer','GTM-KFKRFZQ');</script>,
 <meta charset="unicode-escape"/>,
 <link href="/node/1395982" rel="shortlink"/>,
 <link href="https://www.ftc.gov/sites/default/files/favicon_4.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>,
 <link href="/site-information/open-government/data-sets/do-not-call-data" rel="canonical"/>,
 <meta content="Drupal 7 (http://drupal.org)" name="Generator"/>,
 <meta content="The official website of the Federal Trade Commission, protecting America\u2019s consumers for over 100 years." name="description"/>,
 <meta content="Drupal 7 (http://drupal.org)" name="generator"/>,
 <link href="https://www.ftc.gov/sites/default/

We now have a parsed oject that we can apply simple and useful BS methods to.  

In [6]:
import pandas as pd

def get_links():
    
    links = soup.find_all('a',href=True)#explain this
    links = [link['href'] for link in links
           if link["href"].startswith("https://www.ftc.gov/system/files/attachments/do-not-call-dnc-reported-calls-data/dnc_complaint_numbers_")]

    return links

get_links()


[u'https://www.ftc.gov/system/files/attachments/do-not-call-dnc-reported-calls-data/dnc_complaint_numbers_2018-10-26.csv',
 u'https://www.ftc.gov/system/files/attachments/do-not-call-dnc-reported-calls-data/dnc_complaint_numbers_2018-10-19.csv',
 u'https://www.ftc.gov/system/files/attachments/do-not-call-dnc-reported-calls-data/dnc_complaint_numbers_2018-10-12.csv',
 u'https://www.ftc.gov/system/files/attachments/do-not-call-dnc-reported-calls-data/dnc_complaint_numbers_2018-10-25.csv',
 u'https://www.ftc.gov/system/files/attachments/do-not-call-dnc-reported-calls-data/dnc_complaint_numbers_2018-10-18.csv',
 u'https://www.ftc.gov/system/files/attachments/do-not-call-dnc-reported-calls-data/dnc_complaint_numbers_2018-10-11.csv',
 u'https://www.ftc.gov/system/files/attachments/do-not-call-dnc-reported-calls-data/dnc_complaint_numbers_2018-10-24.csv',
 u'https://www.ftc.gov/system/files/attachments/do-not-call-dnc-reported-calls-data/dnc_complaint_numbers_2018-10-17.csv',
 u'https://www.f

https://console.aws.amazon.com/rds/home?region=us-east-1#GettingStarted:

In [8]:
dfs = [pd.read_csv(link,error_bad_lines=False) for link in get_links()]

Skipping line 11650: expected 8 fields, saw 15



In [9]:
df = pd.concat(dfs, ignore_index=True)

In [10]:
df.count()

Company_Phone_Number            695786
Created_Date                    717774
Violation_Date                  717773
Consumer_City                   452521
Consumer_State                  717183
Consumer_Area_Code              717763
Subject                         717773
Recorded_Message_Or_Robocall    709752
dtype: int64

In [11]:
df=df[df.Created_Date!='N']

In [12]:
df.Created_Date=pd.to_datetime(df.Created_Date)

In [13]:
df.Created_Date.dt.day.unique()

array([25, 18, 11, 24, 17, 10, 23, 16,  9, 22, 15,  8, 19, 20, 21, 12, 13,
        5,  6,  7,  4, 27,  3, 26,  2,  1, 28, 29, 30, 14])

In [14]:
df.Created_Date.max()

Timestamp('2018-10-25 23:59:45')

In [15]:
df.Created_Date.min()

Timestamp('2018-09-14 00:00:07')

In [16]:
df.Company_Phone_Number=df.Company_Phone_Number.astype(str)

Nice Clean data set ready for Tableau. 

In [19]:
df.to_csv('dnc_pull.csv')