## Wrangling Files and Data

Reading files with unstructured data. 

Import pandas as pd:

In [21]:
import pandas as pd

From the Berkeley Earth surface temperature dataset, download localized data for New York City:

In [22]:
! curl -o ./data/nyc_temp.txt http://berkeleyearth.lbl.gov/auto/Local/TAVG/Text/40.99N-74.56W-TAVG-Trend.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  313k  100  313k    0     0  5425k      0 --:--:-- --:--:-- --:--:-- 5497k


Inspect the file:

`! head ./data/nyc_temp.txt`

This file is not well-formatted. Use head/tail to narrow the selection:

`! head -72 nyc_temp.txt | tail -8`

In [23]:
! head -72 nyc_temp.txt | tail -8

head: nyc_temp.txt: No such file or directory


### Reading and Formatting Files

Create a list of column names:

`col_names = ['year', 'month', 'monthly_anom']`

Supply the following arguments in `read_csv`:

`header=None`

`usecols=[0,1,2]`

`names = col_names`

`delim_whitespace=True`

`comment="%"`

`temp = pd.read_csv('nyc_temp.txt', header=None, delim_whitespace=True, usecols=[0,1,2], names=col_names, comment="%")`



In [24]:
col_names = ['year', 'month', 'monthly_anom']
temp = pd.read_csv('../data/nyc_temp.txt', header=None, delim_whitespace=True, usecols=[0,1,2], names=col_names, comment="%")
temp

Unnamed: 0,year,month,monthly_anom
0,1743,11,-2.276
1,1743,12,
2,1744,1,
3,1744,2,
4,1744,3,
...,...,...,...
3234,2013,5,0.574
3235,2013,6,0.982
3236,2013,7,2.172
3237,2013,8,-0.659


### Create a DateTime Index

`date_df = temp.drop('monthly_anom', axis=1)`

`date_df["day"] = 1`

`date_index = pd.DatetimeIndex(pd.to_datetime(date_df))`

`temp = temp.set_index(date_index).drop(["year", "month"], axis=1)`

In [25]:
col_names = ['year', 'month', 'monthly_anom']
temp = pd.read_csv('../data/nyc_temp.txt', header=None, usecols=[0, 1, 2], names=col_names, delim_whitespace=True, comment='%')

date_df = temp.drop('monthly_anom', axis=1)
date_df["day"] = 1
date_index = pd.DatetimeIndex(pd.to_datetime(date_df))
temp = temp.set_index(date_index).drop(["year", "month"], axis=1)
temp

Unnamed: 0,monthly_anom
1743-11-01,-2.276
1743-12-01,
1744-01-01,
1744-02-01,
1744-03-01,
...,...
2013-05-01,0.574
2013-06-01,0.982
2013-07-01,2.172
2013-08-01,-0.659


### Fill Missing Values

`temp = temp.fillna(method="ffill")`

In [26]:
temp = temp.fillna(method="ffill")
temp

Unnamed: 0,monthly_anom
1743-11-01,-2.276
1743-12-01,-2.276
1744-01-01,-2.276
1744-02-01,-2.276
1744-03-01,-2.276
...,...
2013-05-01,0.574
2013-06-01,0.982
2013-07-01,2.172
2013-08-01,-0.659


## Working with Large Datasets

Using the `chunksize` argument of `read_csv` to parse or perfrom operations on large files

In [27]:
filename = "../data/311_Cases2019.csv"
c_size = 200000

for sf_chunk in pd.read_csv(filename, chunksize=c_size):
    print (sf_chunk.shape)

(200000, 47)
(200000, 47)
(200000, 47)
(18764, 47)


Chunking Multiple Operations

In [28]:
for sf_chunk in pd.read_csv(filename, chunksize=c_size):
    sf_chunk = sf_chunk.iloc[:, 0:18]
    sf_chunk = sf_chunk.fillna("None")
    date_index = pd.DatetimeIndex(pd.to_datetime(sf_chunk["Opened"]))
    sf_chunk.loc[:,"Opened"] = date_index
    sf_chunk.loc[:,"Weekday"] = date_index.weekday_name
    print (sf_chunk["Neighborhood"])

0            Hayes Valley
1            Potrero Hill
2         Diamond Heights
3                 Mission
4               Japantown
               ...       
199995       Outer Sunset
199996    South of Market
199997          Oceanview
199998      Hunters Point
199999           Nob Hill
Name: Neighborhood, Length: 200000, dtype: object
200000    Mission Dolores
200001    Dolores Heights
200002            Mission
200003            Mission
200004            Mission
               ...       
399995    South of Market
399996            Mission
399997            Mission
399998           Parkside
399999            Mission
Name: Neighborhood, Length: 200000, dtype: object
400000            Tenderloin
400001        Haight Ashbury
400002       Dolores Heights
400003                  None
400004            Tenderloin
                 ...        
599995       Dolores Heights
599996    Financial District
599997                Marina
599998          Outer Sunset
599999       South of Market
Name: Nei

### NaN Functions

`import numpy as np`

`df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'], columns=['one','two', 'three'])`

`print ("Original" + "\n")`

`print (df)`

`df = df.reindex(['a', 'b', 'c'])`

`print ("\n"  + "Reindexed:" + '\n')`

`print (df)`

`print ("\n"  + "NaN replaced with '0':" + "\n")`

`print (df.fillna(0))`

In [32]:
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'], columns=['one','two', 'three'])
print ("Original" + "\n")
print (df)
df = df.reindex(['a', 'b', 'c'])
print ("\n"  + "Reindexed:" + '\n')
print (df)
print ("\n"  + "NaN replaced with '0':" + "\n")
print (df.fillna(0))

Original

        one       two     three
a -1.199687 -0.914891 -0.732964
c -1.529484 -1.597783 -0.152724
e  0.353078  0.482614  1.049960

Reindexed:

        one       two     three
a -1.199687 -0.914891 -0.732964
b       NaN       NaN       NaN
c -1.529484 -1.597783 -0.152724

NaN replaced with '0':

        one       two     three
a -1.199687 -0.914891 -0.732964
b  0.000000  0.000000  0.000000
c -1.529484 -1.597783 -0.152724


In [33]:
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'], columns=['one','two', 'three'])
print ("Original" + "\n")
print (df)
df = df.reindex(['a', 'b', 'c'])
print ("\n"  + "Reindexed:" + '\n')
print (df)
df = df.dropna()
print ("\n"  + "After Dropping NaN rows:" + '\n')
print (df)

Original

        one       two     three
a -0.644861 -1.105221  0.069899
c -0.423894  1.130578  0.289239
e -1.557106 -1.898490  1.500030

Reindexed:

        one       two     three
a -0.644861 -1.105221  0.069899
b       NaN       NaN       NaN
c -0.423894  1.130578  0.289239

After Dropping NaN rows:

        one       two     three
a -0.644861 -1.105221  0.069899
c -0.423894  1.130578  0.289239
