## Collecting Data from the Internet with Python


### Neal Caren  
Department of Sociology  
University of North Carolina,  Chapel Hill  
neal.caren@gmail.com  
@haphazardsoc  


#### https://github.com/nealcaren/ScrapingData

#### Why Python

![](https://raw.githubusercontent.com/nealcaren/UiOBigData/master/notebooks/images/stanford.png)

![](https://raw.githubusercontent.com/nealcaren/UiOBigData/master/notebooks/images/language_rank.png)

## The PandaKit


* [pandas](http://pandas.pydata.org) Data management and analysis.
* [scikit-learn](http://scikit-learn.org/stable/) Machine learning
* [jupyter notebook](http://jupyter.org) Development environment





<div class="alert alert-info">
<h3>Warm Up</h3>
    <ul>
<li> Store the sentence, "An indictment is merely an allegation and all defendants are presumed innocent until proven guilty beyond a reasonable doubt in a court of law." as a string.
 
</div>

<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
s = "An indictment is merely an allegation and all defendants are presumed innocent until proven guilty beyond a reasonable doubt in a court of law."
print(s)
</details>




<div class="alert alert-info">
<h3>Warm Up</h3>
    <ul>
    <li> Count the number of characters in your string. 
    </ul>
</div>

<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
char_count = len(s)
print(char_count)

</code>
</details>




<div class="alert alert-info">
<h3>Warm Up</h3>
    <ul>
    <li> Count the number of words in your string.
    </ul>
</div>

<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
word_count = len(s.split())
print(word_count)

</code>
</details>




<div class="alert alert-info">
<h3>Warm Up</h3>
    <ul>
            <li> Bonus: Compute the average word length.

</div>

<details>
<summary>Sample answer code</summary> 

<code style="background-color: white">
letter_count = len(s.strip('.').replace(' ',''))
print(letter_count/word_count)
</code>
</details>




##  trolls

For this warm up exercise, you will analyze a sample of 50,000 Russian troll tweets. You can answer each question in the cell space provided in this notebook or start a new notebook.

As part of analysis of Russian influence in the 2016 U.S. presidential election, FiveThirtyEight released almost [three million russian troll tweets](https://fivethirtyeight.com/features/why-were-sharing-3-million-russian-troll-tweets/). FiveThirtyEight obtained the data from Clemson University researchers [Darren Linvill](https://www.clemson.edu/cbshs/faculty-staff/profiles/darrenl) and [Patrick Warren](http://pwarren.people.clemson.edu/). Tweets were collected from accounts identified in the  [November 2017](https://democrats-intelligence.house.gov/uploadedfiles/exhibit_b.pdf) and [June 2018](https://democrats-intelligence.house.gov/uploadedfiles/ira_handles_june_2018.pdf) lists of Internet Research Agency-connected handles [provided to Congress](https://democrats-intelligence.house.gov/news/documentsingle.aspx?DocumentID=396) by Twitter. You can access the FiveThirtyEight repository for a comlete description of the data.


Variables in the file include:

Name | Definition
---|---------
`external_author_id` | An author account ID from Twitter 
`author` | The handle sending the tweet
`content` | The text of the tweet
`region` | A region classification, as [determined by Social Studio](https://help.salesforce.com/articleView?id=000199367&type=1)
`language` | The language of the tweet
`publish_date` | The date and time the tweet was sent
`harvested_date` | The date and time the tweet was collected by Social Studio
`following` | The number of accounts the handle was following at the time of the tweet
`followers` | The number of followers the handle had at the time of the tweet
`updates` | The number of “update actions” on the account that authored the tweet, including tweets, retweets and likes
`post_type` | Indicates if the tweet was a retweet or a quote-tweet
`account_type` | Specific account theme, as coded by Linvill and Warren
`retweet` | A binary indicator of whether or not the tweet is a retweet
`account_category` | General account theme, as coded by Linvill and Warren
`new_june_2018` | A binary indicator of whether the handle was newly listed in June 2018
`alt_external_id` | Reconstruction of author account ID from Twitter, derived from `article_url` variable and the first list provided to Congress
`tweet_id` | Unique id assigned by twitter to each status update, derived from `article_url`
`article_url` | Link to original tweet. Now redirects to "Account Suspended" page
`tco1_step1` | First redirect for the first http(s)://t.co/ link in a tweet, if it exists
`tco2_step1` | First redirect for the second http(s)://t.co/ link in a tweet, if it exists
`tco3_step1` | First redirect for the third http(s)://t.co/ link in a tweet, if it exists


<div class="alert alert-info">
<h3>Load the dataset as a pandas dataframe</h3>
You can access the data directly from <a href="https://raw.githubusercontent.com/nealcaren/UiOBigData⁩/master/notebooks/data/ira_sample_50k.csv">the web.</a> <p>
URL: https://raw.githubusercontent.com/nealcaren/UiOBigData⁩/master/notebooks/data/ira_sample_50k.csv


Hint:
* Remember to `import` pandas. 
* store the URL as a string
* Pandas can load a file from a local directory or from a URL.
</div>

<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
import pandas as pd
    
troll_url = 'https://raw.githubusercontent.com/nealcaren/UiOBigData⁩/master/notebooks/data/ira_sample_50k.csv'
troll_df = pd.read_csv(troll_url)
</code>
</details>



<div class="alert alert-info">
<h3>Inspect the dataframe</h3>

Hint:
* `describe`, `info`, and `sample` are useful.
* You will want to insert additional code cells for each command.
</div>

<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
troll_df.info()
troll_df.describe()
troll_df.sample(10)
</code>
  <p> <p> Note: Make sure to replace <code style="background-color: white">troll_df</code> with the name you used, if you picked something else.
</details>

<div class="alert alert-info">
<h3>How many columns in the dataset?</h3>


Hint: 
* `len` can be used to compute the length of a list.
</div>



<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
len(troll_df.keys())
</code>
</details>


<div class="alert alert-info">
    <h3> How many tweets had links?</h3>

</div>



<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
troll_df['tco1_step1'].describe()
</code>
</details>


<div class="alert alert-info">
<h3>Bonus: How many of the tweets are tagged RightTroll?</h3>

</div>

<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
troll_df['account_category'].value_counts()
</code>
<p>You can also access the number directly:
    <code style="background-color: white">
troll_df['account_category'].value_counts()['RightTroll']
        </code>
</details>

<div class="alert alert-info">
<h3>Bonus: What is the median number of following accounts?</h3>

</div>

<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
troll_df['following'].describe()
</code>
</details>



<div class="alert alert-info">
<h3>New Variable</h3>




<p> Create a new variable <code>tweet_length</code> that is the number of words in the tweet.

Hint:
* This requires you to `apply` a function that you create.
</div>



<details>
<summary>Sample answer code</summary> 
Define a new function that counts the number of words in a string: <p>
<code style="background-color: white">
def word_counter(content):
    '''Count the number of words in a string.'''
    content_words = content.split()
    word_count = len(content_words)
    return word_count
</code>
<p>
Apply the function on the tweet text variable to create a new column.

<code style="background-color: white">


troll_df['tweet_length'] = troll_df['content'].apply(word_counter)
</code>
</details>
