## Wrangling Files and Data

Reading files with unstructured data. 

Import pandas as pd:

In [87]:
import pandas as pd

From the Berkeley Earth surface temperature dataset, download localized data for New York City:

In [1]:
! curl -o ./data/nyc_temp.txt http://berkeleyearth.lbl.gov/auto/Local/TAVG/Text/40.99N-74.56W-TAVG-Trend.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  313k  100  313k    0     0  2520k      0 --:--:-- --:--:-- --:--:-- 2527k


Inspect the file:

`! head ./data/nyc_temp.txt`

This file is not well-formatted. Use head/tail to narrow the selection:

`! head -72 nyc_temp.txt | tail -8`

### Reading and Formatting Files

Create a list of column names:

`col_names = ['year', 'month', 'monthly_anom']`

Supply the following arguments in `read_csv`:

`header=None`

`usecols=[0,1,2]`

`names = col_names`

`delim_whitespace=True`

`comment="%"`

`temp = pd.read_csv('nyc_temp.txt', header=None, delim_whitespace=True, usecols=[0,1,2], names=col_names, comment="%")`



### Create a DateTime Index

`date_df = temp.drop('monthly_anom', axis=1)`

`date_df["day"] = 1`

`date_index = pd.DatetimeIndex(pd.to_datetime(date_df))`

`temp = temp.set_index(date_index).drop(["year", "month"], axis=1)`

In [3]:
col_names = ['year', 'month', 'monthly_anom']
temp = pd.read_csv('nyc_temp.txt', header=None, usecols=[0, 1, 2], names=col_names, delim_whitespace=True, comment='%')



### Fill Missing Values

`temp = temp.fillna(method="ffill")`

## Working with Large Datasets

Using the `chunksize` argument of `read_csv` to parse or perfrom operations on large files

`filename = "./data/311_Cases2019.csv"`

`c_size = 200000`

`for sf_chunk in pd.read_csv(filename, chunksize=c_size):`

&nbsp;&nbsp;&nbsp;&nbsp;`print (sf_chunk.shape)`

Chunking Multiple Operations

`for sf_chunk in pd.read_csv(filename, chunksize=c_size):`

&nbsp;&nbsp;&nbsp;&nbsp;`sf_chunk = sf_chunk.iloc[:, 0:18]`

&nbsp;&nbsp;&nbsp;&nbsp;`sf_chunk = sf_chunk.fillna("None")`

&nbsp;&nbsp;&nbsp;&nbsp;`date_index = pd.DatetimeIndex(pd.to_datetime(sf_chunk["Opened"]))`

&nbsp;&nbsp;&nbsp;&nbsp;`sf_chunk.loc[:,"Opened"] = date_index` 

&nbsp;&nbsp;&nbsp;&nbsp;`sf_chunk.loc[:,"Weekday"] = date_index.weekday_name` 

&nbsp;&nbsp;&nbsp;&nbsp;`print (sf_chunk["Neighborhood"])`

### NaN Functions

`import numpy as np`

`df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'], columns=['one','two', 'three'])`

`print ("Original" + "\n")`

`print (df)`

`df = df.reindex(['a', 'b', 'c'])`

`print ("\n"  + "Reindexed:" + '\n')`

`print (df)`

`print ("\n"  + "NaN replaced with '0':" + "\n")`

`print (df.fillna(0))`

In [137]:
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'], columns=['one','two', 'three'])
print ("Original" + "\n")
print (df)
df = df.reindex(['a', 'b', 'c'])
print ("\n"  + "Reindexed:" + '\n')
print (df)
print ("\n"  + "NaN replaced with '0':" + "\n")
print (df.fillna(0))

Original

        one       two     three
a  1.982738  0.486554  1.536442
c -0.854264  0.861779  0.962401
e -1.045314 -0.060082 -1.786305

Reindexed:

        one       two     three
a  1.982738  0.486554  1.536442
b       NaN       NaN       NaN
c -0.854264  0.861779  0.962401

NaN replaced with '0':

        one       two     three
a  1.982738  0.486554  1.536442
b  0.000000  0.000000  0.000000
c -0.854264  0.861779  0.962401


In [138]:
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'], columns=['one','two', 'three'])
print ("Original" + "\n")
print (df)
df = df.reindex(['a', 'b', 'c'])
print ("\n"  + "Reindexed:" + '\n')
print (df)
df = df.dropna()
print ("\n"  + "After Dropping NaN rows:" + '\n')
print (df)

Original

        one       two     three
a  1.064932  2.136276 -0.430257
c -1.537479 -1.293200  0.211070
e -1.499677 -0.138826 -1.030252

Reindexed:

        one       two     three
a  1.064932  2.136276 -0.430257
b       NaN       NaN       NaN
c -1.537479 -1.293200  0.211070

After Dropping NaN rows:

        one       two     three
a  1.064932  2.136276 -0.430257
c -1.537479 -1.293200  0.211070
