**Dealing with a large dataset on your computer** is a common challenge for data analysts. If you need to process or filter the data you can set the `chunksize=` argument of pandas `read_csv()` method to loop through and work with manageable chunks of the data.

If for example, you wanted to work with a large data file (HCPS_data.csv) to pull just the rows where the `HCPS Code` is **99213**, you could read that file to a chunk of 100,000 rows at a time, filter each chunk to the rows with the specified code, save the filtered results of each chunk to a list and concatenate them all together at the end. The syntax would look something like this:

```
code_99213_rows =[]

for chunk in pd.read_csv('HCPS_data.csv', chunksize = 100000):
    code_99213_rows.append(chunk[chunk['HCPS Code'] == '99213']) 
               
code_99213_df = pd.concat(code_99213_rows, ignore_index=True)
```
======================================================================   

To shrink the size of a file so that it loads more quickly, converting a text file (CSV) to binary might make sense. In python, you can work with data to minimize its footprint and then store the resulting object (dataframe) as a [pickle](https://docs.python.org/3/library/pickle.html) file.

In [1]:
import pandas as pd
import pickle

In [2]:
%%time
may = pd.read_csv('../data/may.csv')
may.head()

Wall time: 29.3 s


Unnamed: 0,pubdatetime,latitude,longitude,sumdid,sumdtype,chargelevel,sumdgroup,costpermin,companyname
0,2019-05-01 00:01:41.247000,36.136822,-86.799877,PoweredLIRL1,Powered,93.0,scooter,0.0,Bird
1,2019-05-01 00:01:41.247000,36.191252,-86.772945,PoweredXWRWC,Powered,35.0,scooter,0.0,Bird
2,2019-05-01 00:01:41.247000,36.144752,-86.806293,PoweredMEJEH,Powered,90.0,scooter,0.0,Bird
3,2019-05-01 00:01:41.247000,36.162056,-86.774688,Powered1A7TC,Powered,88.0,scooter,0.0,Bird
4,2019-05-01 00:01:41.247000,36.150973,-86.783109,Powered2TYEF,Powered,98.0,scooter,0.0,Bird


### Now try to reduce the size of the file
- objects require the most space


In [3]:
may.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20292503 entries, 0 to 20292502
Data columns (total 9 columns):
 #   Column       Dtype  
---  ------       -----  
 0   pubdatetime  object 
 1   latitude     float64
 2   longitude    float64
 3   sumdid       object 
 4   sumdtype     object 
 5   chargelevel  float64
 6   sumdgroup    object 
 7   costpermin   float64
 8   companyname  object 
dtypes: float64(4), object(5)
memory usage: 1.4+ GB


#### convert the company name to an integer 
- find the unique company names
- assign each company an integer (you can use a dictionary for this step)
- update the `companyname` column to store the integer id for each company

In [4]:
may.companyname.unique()

array(['Bird', 'Lyft', 'Gotcha', 'Lime', 'Spin', 'Jump', 'Bolt'],
      dtype=object)

In [5]:
company_dict = {'Bird':0, 'Lyft': 1, 'Gotcha': 2, 'Lime': 3, 'Spin': 4, 'Jump': 5, 'Bolt': 6}

In [6]:
may.companyname = may.companyname.replace(company_dict)

#### next convert `pubdatetime` to a datetime 

In [7]:
may.pubdatetime = pd.to_datetime(may.pubdatetime)
may.head(2)

Unnamed: 0,pubdatetime,latitude,longitude,sumdid,sumdtype,chargelevel,sumdgroup,costpermin,companyname
0,2019-05-01 00:01:41.247,36.136822,-86.799877,PoweredLIRL1,Powered,93.0,scooter,0.0,0
1,2019-05-01 00:01:41.247,36.191252,-86.772945,PoweredXWRWC,Powered,35.0,scooter,0.0,0


#### Next remove unneeded data
#### keep just the scooters

In [8]:
may.sumdgroup.unique()

array(['scooter', 'Scooter', 'bicycle'], dtype=object)

In [9]:
may_scooters = may.loc[may.sumdgroup.isin(['scooter', 'Scooter'])]

#### keep just the columns we want to work with

In [10]:
may_scooters = may_scooters[['pubdatetime', 'latitude', 'longitude', 'sumdid', 'chargelevel', 'companyname']]

#### check `.info()` again

In [11]:
may_scooters.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20283582 entries, 0 to 20292502
Data columns (total 6 columns):
 #   Column       Dtype         
---  ------       -----         
 0   pubdatetime  datetime64[ns]
 1   latitude     float64       
 2   longitude    float64       
 3   sumdid       object        
 4   chargelevel  float64       
 5   companyname  int64         
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 1.1+ GB


#### The only object datatype remaining is sumdid (an alphanumeric unique identifier)
- time to pickle

In [12]:
may_scooters.to_pickle("data/may.pkl")

FileNotFoundError: [Errno 2] No such file or directory: 'data/may.pkl'

In [None]:
%%time
may_test = pd.read_pickle("data/may.pkl")