### Exercise 1. Loading json files with pandas. Reading data is the first step in any data science project. Often, you’ll work with data in json format and run into problems at the very beginning. The Python libraries json and pandas work for the user. This exercise requires the user to solve two common issues when loading a json file to a data frame. Follow the instructions:

***c. Load the 01-call_logs.json file to the 00-load notebook.***

- Explore the json file to infer better the loading strategy 
> By visual inspection, the JSON file has a dictionary-like structure, with each entry consistent of a single element

- Use pandas read_json() method

In [1]:
import pandas as pd
call_logs_01 = pd.read_json('../data/raw/01-call_logs.json')

In [2]:
print(call_logs_01)

                start_date  abandon  prequeue  inqueue  agent_time  postqueue  \
0      2019-01-01 00:00:00        1         5      153           0          0   
1      2019-01-28 16:43:00        1       233        0           0          0   
2      2019-01-31 11:36:00        0        12        0        1018          1   
3      2019-01-17 13:23:00        0         6        4         114          0   
4      2019-01-22 13:58:00        0         5       51         141          0   
...                    ...      ...       ...      ...         ...        ...   
31594  2019-01-07 09:12:00        0         6      105         275          0   
31595  2019-01-29 09:37:00        0        13        0         314          0   
31596  2019-01-22 11:39:00        0         6       65         178          0   
31597  2019-01-15 12:06:00        0        10        0           2          0   
31598  2019-01-17 17:46:00        1         9       15           0          0   

       total_time  sla    c

- Test different parameters in the read_json() function

> We may try different options to read_json. For example, we may load the entire json file as one line:
>
> `pd.read_json('../data/raw/01-call_logs.json', lines=True)`
>
> or a series:
>
> `pd.read_json('../data/raw/01-call_logs.json', typ='series')`
>
> We can also use to the function [`to_json()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html#pandas.DataFrame.to_json) to convert the dataframe into a json string, and then 'split' in a `dict` manner the columns, index and values:
>
> `pd.read_json('../data/raw/01-call_logs.json').to_json(orient='split')`

***d. Load the 02-call_logs.json file to the 00-load notebook.***

- Explore the json file to infer better the loading strategy
> By visual inspection, the JSON file has a nested dictionary-like structure with element 'call_logs 

- Use pandas read_json() method

In [3]:
call_logs_02 = pd.read_json('../data/raw/02-call_logs.json')

In [4]:
print(call_logs_02)

        id_cont  cont_name cont_last_name                  cont_email  \
0      10010027  Victorino         Tudela  domingocompany@example.net   
1      10010027  Victorino         Tudela  domingocompany@example.net   
2      10010027  Victorino         Tudela  domingocompany@example.net   
3      10010027  Victorino         Tudela  domingocompany@example.net   
4      10010027  Victorino         Tudela  domingocompany@example.net   
...         ...        ...            ...                         ...   
33339  Z999265T    Stewart          Davis    williamsiain@example.com   
33340  Z999265T    Stewart          Davis    williamsiain@example.com   
33341  Z999265T    Stewart          Davis    williamsiain@example.com   
33342  Z999608T     Connor          Baker      morganglen@example.net   
33343  Z999608T     Connor          Baker      morganglen@example.net   

          cont_phone  id_agn agn_name  last_name  \
0       750144000000   12750   Elston     Howard   
1       75014400000

- Verify the nested list call_log inside the pandas dataframe
> Already mentioned two bullet points above. This is also observed when printing call_logs_02 in the bullet point above.

- Explore a method to flatten the json file when loading to pandas dataframe
> To flatten a dataframe, we may first convert it to a numpy array using [`to_numpy()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html) and then flatten it using [`flatten()`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flatten.html)

In [5]:
print(call_logs_02.to_numpy().flatten())

['10010027' 'Victorino' 'Tudela' ... 3 'Reflective'
 list([{'contact_id': '0314374986bc', 'master_id': '2214374986bc', 'start_date': '2/1/2019 20:48', 'abandon': 0, 'prequeue': 10, 'inqueue': 219, 'agent_time': 178, 'postqueue': 0, 'total_time': 407, 'sla': 1.0, 'abandon_time': 0, 'date': '2/1/2019', 'start_time': '20:48:00'}])]


> In addition, we may also flatten the dataframe using the method described in point (c), by specifying `lines=True` in the parameters. For this case, we retain the dictionary-like structure with keys being shown.

In [6]:
print(pd.read_json('../data/raw/02-call_logs.json', lines=True))

                                               0      \
0  {'id_cont': '10010027', 'cont_name': 'Victorin...   

                                               1      \
0  {'id_cont': '10010027', 'cont_name': 'Victorin...   

                                               2      \
0  {'id_cont': '10010027', 'cont_name': 'Victorin...   

                                               3      \
0  {'id_cont': '10010027', 'cont_name': 'Victorin...   

                                               4      \
0  {'id_cont': '10010027', 'cont_name': 'Victorin...   

                                               5      \
0  {'id_cont': '10010027', 'cont_name': 'Victorin...   

                                               6      \
0  {'id_cont': '1005222', 'cont_name': 'Juan Anto...   

                                               7      \
0  {'id_cont': '1005222', 'cont_name': 'Juan Anto...   

                                               8      \
0  {'id_cont': '1005222', 'cont_name': '

***e. Write a method wrapping the techniques of c. and d. above to load json files. For example, a method load_json() with parameters to load directly or flatten and load json files***

In [7]:
def load_json(file, flat=False):
    """A method to return a JSON file
        file: name of file with path included
        flat: if True, it flattens the dataframe. Default is False"""

    if flat is True:
        return pd.read_json(file).to_numpy().flatten()
    else:
        return pd.read_json(file)

In [8]:
print(load_json('../data/raw/02-call_logs.json', flat=True))

['10010027' 'Victorino' 'Tudela' ... 3 'Reflective'
 list([{'contact_id': '0314374986bc', 'master_id': '2214374986bc', 'start_date': '2/1/2019 20:48', 'abandon': 0, 'prequeue': 10, 'inqueue': 219, 'agent_time': 178, 'postqueue': 0, 'total_time': 407, 'sla': 1.0, 'abandon_time': 0, 'date': '2/1/2019', 'start_time': '20:48:00'}])]


### Exercise 2. Loading data and save to parquet. Other common formats to store data are txt, csv, xlsx, and parquet. A challenge in Big Data is related to a variety of data, e.g., different formats. In this exercise, you create a method to read txt, csv, xlsx, and parquet files to a dataframe. Follow the steps:

***c. For each of the files with prefix 03-call_logs to 06-call_logs load the file to the 00-load notebook. Please observe the following:***

- Explore the methods in pandas to load files based on their formats 
> To load parquet files, we may use [`read_parquet`](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html), for which we first need to install the packages `pyarrow` and `fastparquet`

In [9]:
pip install pyarrow

Note: you may need to restart the kernel to use updated packages.


In [10]:
pip install fastparquet

Note: you may need to restart the kernel to use updated packages.


In [11]:
call_logs_03 = pd.read_parquet('../data/raw/03-call_logs.parquet')

> To load parquet files, we may use [`read_excel`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html), for which we first need to install the package `openpyxl`:

In [12]:
pip install openpyxl

Note: you may need to restart the kernel to use updated packages.


In [13]:
call_logs_04 = pd.read_excel('../data/raw/04-call_logs.xlsx', sheet_name='call_logs')

> To load txt files, we may use [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). By visual inspection, we noticed the txt file is comma-separated, so we may add the optional argument `delimiter=','` to properly load the data into a dataframe.

> To load csv and txt files, we may use [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). By visual inspection, we noticed the txt file is comma-separated, so we may add the optional argument `delimiter=','` to properly load the data into a dataframe.

In [14]:
call_logs_05 = pd.read_csv('../data/raw/05-call_logs.txt', delimiter=',')

In [15]:
call_logs_06 = pd.read_csv('../data/raw/06-call_logs.csv')

***d. Write a method to automatically load the call logs files. Do not miss the json file loader from exercise 1 to load the json files.***

In [16]:
def load_call_logs(file, flat=False, sheet='call_logs'):
    """A method to load call_logs files of extension txt, cvs, xlsx, parquet or json
        file: name of file with path included
        flat: if True, it flattens the dataframe. Default is False
        sheet: in case it is a csv or xlsx file, it specifies the sheet name. Default is 'call_logs'"""

    if file.endswith('.txt'):
        df = pd.read_csv(file, delimiter=',')
    elif file.endswith('.csv'):
        df = pd.read_csv(file)
    elif file.endswith('.xlsx'):
        df = pd.read_excel(file, sheet_name=sheet)
    elif file.endswith('.parquet'):
        df = pd.read_parquet(file)
    elif file.endswith('.json'):
        df = pd.read_json(file)
    else:
        return 'Not a valid extension'

    # Remove unnamed columns
    df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

    # Flatten if flat=True
    if flat is True:
        return df.to_numpy().flatten()
    else:
        return df