# Importing data from a number of zipfiles containing csv files

This notebook can be used to extract information from zipped csv files (with multiple zipped folders). The code accessesses the csv files and merges all together in one pandas dataframe.

Run the first three cells. Then, specify your path, and the number of zipped files you want to convert to a dataframe.
The unzipping is done sequentially: specifying 'number_to_unpack = 3' will unpack the first three zipfiles, and store the information in a DataFrame.

Typically useful to access data in zipfiles that need to be accessed often.

Created: 16/10/2018, L.A. Damen

In [None]:
import pandas as pd
import zipfile
from os import listdir
import timeit
import pickle

In [None]:
def zip_to_df(this_zip):
    '''
    Converts all csv files in single zipped file (this_zip) to pandas dataframe.
    My_path must be specified explicitly in the main script (cell 4)
    
    Input: single zipfile containing csv files
    Output: dataframe of all csv files in zipfile
    
    '''
    print("Starting conversion for " + str(this_zip))
    zf = zipfile.ZipFile(my_path + this_zip)
    all_csv_names = zf.namelist() # contains all the names of the csv files within that zipfile
    df_this_zip = pd.concat((pd.read_csv(zf.open(csv_name), parse_dates = True) for csv_name in all_csv_names))
    print("Finished.")
    
    return df_this_zip

In [None]:
def get_df_from_directory(my_path, num_zips = -1):
    ''''
    Using zip_to_df, converts all csv files from the zipfiles to pandas dataframe
    
    Input:
    1. my_path: path where all the zipfiles are located
    2. num_zips: number of zipped files you want to create a DataFrame of.
        The default for this parameter is ALL files
        
    Output: dataframe of all csv files in all zipfiles
    '''
    all_zipnames = listdir(my_path)
    
    if num_zips == -1:
        num_zips = len(all_zipnames)
        
    final_df = pd.concat(list(map(zip_to_df, all_zipnames[0:num_zips])))
    
    return final_df

In [None]:
## Calling the Main Function

my_path = 'C:/SOMETHING_HERE/'
number_to_unpack = ...

df_final = get_df_from_directory(my_path, number_to_unpack)

df_final.head()

In [None]:
## Saving the dataframe created

## Use pickle to load fast in python, or alternatively csv 

df_final.to_csv("all_data.csv")
df_final.to_pickle("all_data.pkl")