# Data Preprocessing and Exploratory Data Analysis

## Data Preprocessing
Now that we've scrapped all the data, we can start looking into it prior to performing some sort of analysis.
To make things a bit more interesting, let's load the `data_large.csv` file. This file is very similar to the output from the `scraping` notebook, but it was performed over a distance of 9999 (the US is ~3000 miles across).
We can easilly load CSV files into a `pandas.DataFrame` with the `read_csv` method!

In [13]:
import numpy as np
import pandas as pd

In [2]:
df_large = pd.read_csv("./../assets/data_large.csv")

Now that the dataframe is loaded, let's check out the size (rows, columns) of the data. This is done by checking the `shape` attribute. The returned output is a tuple identifying the number of rows and columns.

In [3]:
print(df_large.shape)

(38912, 5)


38,912 entries with 5 features! Not too shabby. Let's see what a few of the entries look like. There are a few ways to do this and the common methods to use are `head()`, `tail()`, and `sample()`. Each one's first argument is the number of entries to return, by default the first two methods return 5 entires, and sample returns a single one.
* Personally, I prefer using `sample()` for most situations. The former two are great for time series data however, or anything that has an order to it.

In [4]:
df_large.sample(8)  # recall that the 8 means number of entries to return

Unnamed: 0.1,Unnamed: 0,city,searched_zipcode,url,content
23689,23689,SanDiego,92154,https://losangeles.craigslist.org/wst/res/d/st...,I'm currently studying twice a week. I have a ...
21862,21862,Nashville,37211,https://memphis.craigslist.org/res/d/flooring-...,Looking for inside/outside sales position in f...
9517,9517,NewYork,10025,https://newyork.craigslist.org/lgi/res/d/part-...,We are a busy Learning and Technology Center l...
23534,23534,SanDiego,92154,https://losangeles.craigslist.org/wst/res/d/se...,"Hi, I am 23 years old, I am a laid back guy, w..."
10071,10071,NewYork,10025,https://southjersey.craigslist.org/res/d/looki...,I live in hammonton and currently working in w...
22623,22623,SanDiego,92154,https://losangeles.craigslist.org/sfv/res/d/lo...,"Hi my names kenny im 24 years old, i was born ..."
18242,18242,Houston,77036,https://houston.craigslist.org/res/d/professio...,I need a professional team.\n\nhttps://youtu.b...
2939,2939,Chicago,60629,https://racine.craigslist.org/res/d/interior-p...,Interior and Exterior painting. Epoxy floor co...


Looks familiar! But what is the `Unnamed: 0` column? By default, when we export the `DataFrame` as a CSV, it saves the index number as a new column, and when it is reloaded, it does not assume that the first column is the index. There are a few ways to go about this:
* When exporting with `DataFrame.to_csv()` use the argument `index=False` to prevent the index column being saved.
* When importing with `pandas.read_csv()` use the argument `index_col=0` to specify that the first (or other column) is the index.
* Drop the column after importing the CSV.

Since the data is already loaded, let's use the last method of dropping the column.

In [5]:
df_large.drop(['Unnamed: 0'], axis=1, inplace=True)  # inplace means that it modifys the variable directly, equivilent to df_large = df_large.drop(...)

df_large.sample()

Unnamed: 0,city,searched_zipcode,url,content
6428,LosAngeles,90044,https://losangeles.craigslist.org/ant/res/d/lo...,I have mig and stick experience


Great! Now, another way to slim down the dataset is to check for duplicates. Remember how this larger set has its search distance at effectively the entire US? Well, as a result, there's a pretty good chance we're going to get duplicate entries. There's multiple ways to check for this condition:
* Use the `DataFrame.duplicated()` method, which returns a boolean series of which rows and duplicates.
* Use the `DataFrame.nunique()` method to figure out how many unique values there are. 

The issue with the first method is that it check if entries are identical. Since duplicates may exist from different zipcode searches, it may not return the duplicate's we're interested in (content or url). A simple way to check the number of duplicated values found with this approach is to `sum()` the result. This works because the return value of `duplicated()` is a boolean value per row, where False = 0 and True = 1. 

Let's run both methods!

In [6]:
print('Sum of duplicated: ', df_large.duplicated().sum())
print('\nnunique results:\n', df_large.nunique())

Sum of duplicated:  20

nunique results:
 city                   18
searched_zipcode       29
url                 13967
content             12933
dtype: int64


Ah! As we expected, duplicated perform what we'd like. The function does accept arguments however that will allow you to perform what we're looking for (You can pass a list of columns to check, i.e. `df_large.duplicated(['content'])` which will return a boolean series). 

Nonetheless however!  We find that `nunique()` did what we needed, we find that there are 12,933 unique resume's. Let's actually use the `duplicated()` method here to keep just the unique resumes. 

In [7]:
df_large = df_large[~df_large.duplicated(['content'])]

print("New dataframe size: ", df_large.shape)

New dataframe size:  (12933, 4)


First, we find that the new dataframe size matches the number of unique values found in the cell above. 

The line above may look a little confusing, so let's break it down.

1. `df_large.duplicated(['content'])` finds all the duplicated rows and by default, only keeps the first instance of it. This results in a boolean series, which denotes True/False for each entry.
2. The tilde (\~) denotes negation, which simply means flip all True/False values. This in effect represent's all the entries that are not duplicated. 
3. `df_large[~df_large.duplicated(['content'])]` is called indexing, which says return `df_large` where entires in the brackets are True, in this case the not duplicated rows. 
4. We reassign df_large to this new "not duplicated" version. 

Sweet! Now the data is a little cleaner, we've reduced the size to under a third of the original!

Some other useful sanity checks include: 
* `DataFrame.describe()` which displays measures of central tendency and some other metrics on all applicable columns (numerical).
* `DataFrame.dtypes` is an attribute that keeps track of all the data types used. Useful for larger sets.

Some of you may have noticed that the URL link may not match the `City` value. This may result in issues later on. A simple fix to this is to use the url city as opposed to the only found during the search, especially since we dropped duplicates on the content feature whilst retaining only the first encountered entry.  We'll fix this with the line below.

In [9]:
sr_split = df_large['url'].str.split("//", expand=True)[1]
sr_split = sr_split.str.split(".", expand=True)[0]
df_large['city'] = sr_split

df_large.sample(5)

Unnamed: 0,city,searched_zipcode,url,content
4537,killeen,77084,https://killeen.craigslist.org/res/d/honestdep...,I do not understand I am 56 a hard worker look...
19893,modesto,93722,https://modesto.craigslist.org/res/d/experienc...,I am a freelance editor/proofreader located in...
8078,newyork,10025,https://newyork.craigslist.org/wch/res/d/hirin...,Would you like to work with a friendly and res...
10294,southjersey,10025,https://southjersey.craigslist.org/res/d/perso...,Willing to work weekends\nFlexible during the ...
4375,sanantonio,77084,https://sanantonio.craigslist.org/res/d/lookin...,Im a very responsible and honest young man tha...


Woah, what the heck? Sorry, the above is not very Pythonic, but it is more efficient than looping through each row. Let's break it down.

1. `pandas.Series.str.split()` acts similarly to the `split()` method on strings but is performed on an entire `pandas.Series` instead; it breaks the up the strings by the first argument. The `expand` argument makes each split a new column as opposed to a single column with a list, which is the default. 
2. Since the returned value from the split function is a `pandas.DataFrame`, the column names represent the split index (again, similar to a string split). When we split by `//`, we're removing the `https://` portion of the string, and taking the portion of the string after it by taking the 1st index.
3. We do the same thing, but this time split at the period, and return the first split value, which is the city via url.
4. Since we're taking a single column from the `pandas.DataFrame`, the variable will default to a series, which we can use to replace the `city` column in our original `DataFrame`.

Let's see how many cities we now have!

In [10]:
df_large.nunique()

city                  193
searched_zipcode       26
url                 12933
content             12933
dtype: int64

Makes a lot more sense than the 18 we previously had! 

At this point, the `searched_zipcode` is probably useless and may actually be a bit confusing since it may not match up to the city. Let's drop it and continue our analysis.

In [11]:
df_large.drop(['searched_zipcode'], axis=1, inplace=True)

In [68]:
def get_education(driver, url):
    driver.get(url)
    try:
        paths = driver.find_elements_by_xpath('/html/body/section/section/section/div[1]/p/span')
        education = [i.text for i in paths if i.text.find('education') != -1]  # sometimes the last item is a liscense
        if len(education) == 0:
            return education[0].split(':')[-1].strip()
        else:
            return np.nan
    except:
        return np.nan

In [62]:
from selenium import webdriver
driver = webdriver.Chrome("./../assets/chromedriver")

In [None]:
for i, v in df_large.iterrows():
    df_large.loc[i, 'education'] = get_education(driver, v['url'])

Alright! Enough of this cleaning stuff, let's get to some analysis!

## Analysis

There are a _ton_ of visulation packages available in Python (though they're not a pretty and powerful as some R ones), but by far the most commonly used is matplotlib, which is maintained by NumFOCUS. We'll be exploring some plots in both matplotlib and a higher level version based off of it called seaborn (easier to make pretty plots). 