# Final Project - Data Acquisition Script
### Acquiring data in fulfillment of the final project for ILS Z-639, "Social Media Mining"
The following script will be run three times each day, over the course of seven sequential days, gathering 1400 Twitter Statuses each time and appending them to a dataset. To accomplish this, the script will execute the following tasks:

1. Access the Twitter API.
2. Create data storage and execute search for Twitter Statuses.
    * Execute a search on the school's name.
    * Use the cursor object to account for rate-limiting, and accumulate Twitter Statuses in a temporary dictionary.
    * Append the Twitter Statuses gathered this way to the accumulator dataframe.
3. Write/append the accumulator dataframe to file.
4. Check the state of the dataset, and create a new backup copy each time the script is executed.

Once the data-set is accumulated, it can be accessed by other scripts for cleaning/prep and analysis.

---

## 1. Access the Twitter API:

In [11]:
import tweepy as tp
import pandas as pd
import json

In [None]:
with open("Twitter_Keys.json", "r") as file:
    keys = json.load(file)

API_KEY = keys["API_KEY"]
API_SECRET = keys["API_SECRET"]

In [2]:
auth = tp.AppAuthHandler(API_KEY, API_SECRET)
api = tp.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

---

## 2. Data Storage Creation and Search/Accumulation:

In [3]:
bigten = [
    "Indiana University",
    "Michigan State University",
    "Northwestern University",
    "Ohio State University",
    "Penn State",
    "Purdue University",
    "Rutgers University",
    "University of Maryland",
    "University of Illinois",
    "University of Iowa",
    "University of Michigan",
    "University of Minnesota",
    "University of Nebraska",
    "University of Wisconsin"
]

df = pd.DataFrame(columns=["School", "Text", "Location", "Time"])

In [4]:
df.shape

(0, 4)

In [5]:
for school in bigten:
    c = tp.Cursor(api.search, q=school, lang="en")
    for status in c.items(100):
        tempDict = {"School":school, "Text": status.text, "Location":status.geo, "Time":status.created_at}
        df = df.append(tempDict, ignore_index=True)
    
    

In [6]:
print(df.shape)

(1400, 4)


## 3. Write/Append Data to a File:

In [7]:
import pickle as pkl
import os

In [8]:
# Setup the dataframe for the first time this file is created.
projectData = pd.DataFrame(columns=["School", "Text", "Location", "Time"])

# Check if the file exists. If so, load in the current state of the data.
if os.path.exists("ProjectData.pkl"):
    with open("ProjectData.pkl", "rb") as fileData:
        projectData = pkl.load(fileData)
    fileData.close()
        
# Append the df created in the cells above to either
# (1) the empty projectData dataframe, or (2) the current state of the data as it has been loaded.
projectData = projectData.append(df)
with open("ProjectData.pkl", "wb") as fileData:
    pkl.dump(projectData, fileData)
fileData.close()

### Check the Curent State of the Dataset, and Create a Backup Copy:

In [9]:
# Check if the dataset is intact.
with open("ProjectData.pkl", "rb") as fileData:
    accumulated_dataset = pkl.load(fileData)
fileData.close()
    
print(accumulated_dataset.shape)
print(accumulated_dataset.head(5))

# Overwrite the backup dataset with the full current dataset.
with open("BackupDataset.pkl", "wb") as backup:
    pkl.dump(accumulated_dataset, backup)
backup.close()

(2800, 4)
               School                                               Text  \
0  Indiana University  RT @Prospects_IN: Indiana Prospects Scout Day ...   
1  Indiana University  RT @loganmichael99: Excited to announce that I...   
2  Indiana University  RT @Hgrooo8: Very blessed and humbled to annou...   
3  Indiana University  Thoughts and prayers to our sisters at Indiana...   
4  Indiana University  RT @MPA_Vikings: @hordefb with the win over In...   

  Location                Time  
0     None 2017-09-26 02:30:08  
1     None 2017-09-26 02:27:22  
2     None 2017-09-26 02:26:24  
3     None 2017-09-26 02:17:38  
4     None 2017-09-26 02:11:10  


### And... obsessively check the backup dataset to make sure it is alright:

In [10]:
with open("BackupDataset.pkl", "rb") as checkBackup:
    bup = pkl.load(checkBackup)
checkBackup.close()
print(bup.shape)
print(bup.head(5))

(2800, 4)
               School                                               Text  \
0  Indiana University  RT @Prospects_IN: Indiana Prospects Scout Day ...   
1  Indiana University  RT @loganmichael99: Excited to announce that I...   
2  Indiana University  RT @Hgrooo8: Very blessed and humbled to annou...   
3  Indiana University  Thoughts and prayers to our sisters at Indiana...   
4  Indiana University  RT @MPA_Vikings: @hordefb with the win over In...   

  Location                Time  
0     None 2017-09-26 02:30:08  
1     None 2017-09-26 02:27:22  
2     None 2017-09-26 02:26:24  
3     None 2017-09-26 02:17:38  
4     None 2017-09-26 02:11:10  
