#Data Preprocessing and Exploratory Data Analysis

Now that we've scrapped all the data, we can start looking into it prior to performing some sort of analysis.
To make things a bit more interesting, let's load the `data_large.csv` file. This file is very similar to the output from the `scraping` notebook, but it was performed over a distance of 9999 (the US is ~3000 miles across).
We can easilly load CSV files into a `pandas.DataFrame` with the `read_csv` method!

In [1]:
import pandas as pd

In [25]:
df_large = pd.read_csv("./../assets/data_large.csv")

Now that the dataframe is loaded, let's check out the size (rows, columns) of the data. This is done by checking the `shape` attribute. The returned output is a tuple identifying the number of rows and columns.

In [13]:
print(df_large.shape)

(38912, 5)


38,912 entries with 5 features! Not too shabby. Let's see what a few of the entries look like. There are a few ways to do this and the common methods to use are `head()`, `tail()`, and `sample()`. Each one's first argument is the number of entries to return, by default the first two methods return 5 entires, and sample returns a single one.
* Personally, I prefer using `sample()` for most situations. The former two are great for time series data however, or anything that has an order to it.

In [20]:
df_large.sample(8)  # recall that the 8 means number of entries to return

Unnamed: 0.1,Unnamed: 0,city,searched_zipcode,url,content
10167,10167,NewYork,10025,https://newyork.craigslist.org/mnh/res/d/exp-h...,Experienced House Mgr. Available\nI've worked ...
20974,20974,Chicago,60618,https://grandrapids.craigslist.org/res/d/exter...,Experienced Siders\n\nExperienced roofers\n\nE...
20246,20246,Chicago,60618,https://chicago.craigslist.org/nwc/res/d/offic...,Experience cleaning lady is available for offi...
11505,11505,Dallas,75217,https://oklahomacity.craigslist.org/res/d/no-j...,"Power washing,yard cleaning, and any similar s..."
31793,31793,NewYork,10029,https://newyork.craigslist.org/mnh/res/d/itali...,"Dear Business Owners & Financial Investors, I ..."
30736,30736,Charlotte,28269,https://charlotte.craigslist.org/res/d/indepen...,"Hello, my name is Bruce, I'm a retired transpo..."
16870,16870,NewYork,10002,https://hudsonvalley.craigslist.org/res/d/need...,Can't afford to hire a bookkeeper full time or...
24672,24672,Sacramento,95823,https://sfbay.craigslist.org/sby/res/d/class-e...,"I have a Class A for 18 years, I have no point..."


Looks familiar! But what is the `Unnamed: 0` column? By default, when we export the `DataFrame` as a CSV, it saves the index number as a new column, and when it is reloaded, it does not assume that the first column is the index. There are a few ways to go about this:
* When exporting with `DataFrame.to_csv()` use the argument `index=False` to prevent the index column being saved.
* When importing with `pandas.read_csv()` use the argument `index_col=0` to specify that the first (or other column) is the index.
* Drop the column after importing the CSV.

Since the data is already loaded, let's use the last method of dropping the column.

In [26]:
df_large.drop(['Unnamed: 0'], axis=1, inplace=True)  # inplace means that it modifys the variable directly, equivilent to df_large = df_large.drop(...)

df_large.sample()

Unnamed: 0,city,searched_zipcode,url,content
6776,LosAngeles,90044,https://orangecounty.craigslist.org/res/d/prog...,CORE COMPETENCIES\n\nCxO-Level Business Strate...


Great! Now we 

In [None]:
df_large.nunique()

Some other useful sanity checks include: 
* `DataFrame.describe()` which displays measures of central tendency and some other metrics on all applicable columns (numerical).
* `DataFrame.dtypes` is an attribute that keeps track of all the data types used. Useful for larger sets.
* 

city                   18
searched_zipcode       29
url                 13967
content             12933
dtype: int64

In [11]:
df_large['url'].duplicated().sum()

24945