# Data Collection and Cleaning



### First Step:
Get the html files that Facebook gave me and scrape them. Ideally, I want to be able to access every post from my timeline and every message from my messages file, then I can categorize them by the time that they were created.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime as dt
from datetime import datetime


In [2]:
#getting my timeline html file that I recieved from Facebook
with open("timeline.htm") as f:
    req = f.read()

In [3]:
soup = BeautifulSoup(req, "html.parser")

In [4]:
dates = soup.find_all("div", {"class": "meta"})

In [5]:
#scratch workbook to understand how to get the attributes that I need out of the data (you can ignore this cell)
dates[2].text.split(",")[2].strip()
month = dates[2].text.split(",")[1].strip().split(" ")[0]
day = dates[2].text.split(",")[1].strip().split(" ")[1]
month, day
time = dates[2].text.split(",")[2].strip().split(" ")[2]
time
if "pm" in time:
    time = time.rstrip("pm")
    arr = time.split(":")
    temp = int(arr[0])
    temp += 12
    arr[0] = str(temp)
    time = ":".join(arr)
time, temp, arr
dates[2]
dates[2].find_next("div")

<div class="meta">Wednesday, May 17, 2017 at 1:33pm PDT</div>

### Next Step:
Now that I understand how my data works, I can create my dataframe. Ideally, for my timeline I want the text from every post that I created, along with the year, month, day, and time that corresponds to that post. Then I want to do the same for every message that I've ever sent. One interesting thing I decided to do was to  not include anything from October 13th for both my messages and my timeline. I did this because October 13th is my birthday and I know that my timeline and messages were almost all incredibly positive on that day, which could have skewed my data a little bit.

In [6]:
timeline = pd.DataFrame(columns=["Year", "Month", "Day", "Time", "Text", "Hour", "Minutes"])
timeline

Unnamed: 0,Year,Month,Day,Time,Text,Hour,Minutes


In [7]:
Years = []
Months = []
Days = []
Texts = []
Times = []
Hours = []
Minutes = []
for comment in soup.find_all("div", {"class": "comment"}):
    date = comment.find_previous("div", {"class": "meta"})
    month = date.text.split(",")[1].strip().split(" ")[0]
    day = int(date.text.split(",")[1].strip().split(" ")[1])
    year = int(date.text.split(",")[2].strip().split(" ")[0])
    time = date.text.split(",")[2].strip().split(" ")[2]
    if "pm" in time:
        time = time.rstrip("pm")
        arr = time.split(":")
        hour = int(arr[0])
        minute = int(arr[1])
        temp = int(arr[0])
        temp += 12
        arr[0] = str(temp)
        time = ":".join(arr)
    elif "am" in time:
        time = time.rstrip("am")
        arr = time.split(":")
        hour = arr[0]
        minute = arr[1]
    if(day != "October 13"):
        Years.append(year)
        Months.append(month)
        Days.append(day)
        Texts.append(comment.text)
        Times.append(time)
        Hours.append(hour)
        Minutes.append(minute)

In [8]:
timeline["Year"] = Years
timeline["Month"] = Months
timeline["Day"] = Days
timeline["Text"] = Texts
timeline["Time"] = Times
timeline["Hour"] = Hours
timeline["Minutes"] = Minutes
timeline.head()

Unnamed: 0,Year,Month,Day,Time,Text,Hour,Minutes
0,2017,May,16,10:48,If anyone wants to see some awesome kids hitti...,10,48
1,2017,May,7,5:39,Just wanted to share that today at 12:45pm Eas...,5,39
2,2017,May,4,14:13,So I took a photo of a dog in my chem lab and ...,2,13
3,2017,April,30,23:21,Can't wait for this weekend! Let's go Poly Pol...,11,21
4,2017,April,14,12:11,"Margarita is ready for tomorrow, are you?",12,11


### Next Step:
Now I have all of the data from my timeline posts, so I'll need to do the same for my data from messages. The process is similar, but the HTML files were layed out differently, so in order to scrape this file I needed to specify different classes and tags to get the text and time information that I wanted.

In [9]:
with open("messages.htm") as f:
    req = f.read()

In [10]:
soup = BeautifulSoup(req, "html.parser")

In [11]:
dates = soup.find_all("span", {"class": "meta"})

In [12]:
messages = pd.DataFrame(columns=["Year", "Month", "Day", "Time", "Text"])
messages

Unnamed: 0,Year,Month,Day,Time,Text


In [13]:
Years = []
Months = []
Days = []
Texts = []
Times = []
Hours = []
Minutes = []
for comment in soup.find_all("p"):
    date = comment.find_previous("span", {"class": "meta"})
    month = date.text.split(",")[1].strip().split(" ")[0]
    day = int(date.text.split(",")[1].strip().split(" ")[1])
    year = int(date.text.split(",")[2].strip().split(" ")[0])
    time = date.text.split(",")[2].strip().split(" ")[2]
    if "pm" in time:
        time = time.rstrip("pm")
        arr = time.split(":")
        hour = int(arr[0])
        minute = int(arr[1])
        temp = int(arr[0])
        temp += 12
        arr[0] = str(temp)
        time = ":".join(arr)
    elif "am" in time:
        time = time.rstrip("am")
        arr = time.split(":")
        hour = int(arr[0])
        minute = int(arr[1])
    if(day != "October 13"):
        Years.append(year)
        Months.append(month)
        Days.append(day)
        Texts.append(comment.text)
        Times.append(time)
        Hours.append(hour)
        Minutes.append(minute)

In [14]:
messages["Year"] = Years
messages["Month"] = Months
messages["Day"] = Days
messages["Text"] = Texts
messages["Time"] = Times
messages["Hour"] = Hours
messages["Minutes"] = Minutes
messages.head()

Unnamed: 0,Year,Month,Day,Time,Text,Hour,Minutes
0,2015,May,11,24:28,smithygirl@gmail.com :) Good luck Sheridan! Ha...,12,28
1,2013,August,21,17:00,Haha ya ill convert you next!!!,5,0
2,2013,August,21,13:28,haha yeah I guess so! I saw that you took Loga...,1,28
3,2013,August,21,13:26,I know who that is... I just don't know him pe...,1,26
4,2013,August,21,13:25,Hold on its loading,1,25


### Final Step:
Now I have all of my text data for my timeline posts and my messages, with the year, month, day, hours, and minutes that correspond to a specific post or message! I'll need to access this in my other notebooks, so I used pickle to upload it.

In [15]:
import pickle
pickle.dump(messages, open("messages.pkl", "wb"))
pickle.dump(timeline, open("timeline.pkl", "wb"))