In [1]:
import pandas as pd

## 1. ! (system shell access)

In a Jupyter notebook, any statement that you start with an exclamation mark (!), will be sent to the underlying operating system.

In [2]:
!cd

C:\Users\jnpicao\Documents\GitHub\batch3-workspace\S02 - Data Wrangling


In [3]:
!dir

 Volume in drive C is Windows
 Volume Serial Number is 261F-6607

 Directory of C:\Users\jnpicao\Documents\GitHub\batch3-workspace\S02 - Data Wrangling

05/08/2019  12:35    <DIR>          .
05/08/2019  12:35    <DIR>          ..
01/08/2019  15:51    <DIR>          .ipynb_checkpoints
05/08/2019  11:51    <DIR>          BLU01 - Messy Data
05/08/2019  12:35            21 366 S02 - Data Wrangling - resumo.ipynb
               1 File(s)         21 366 bytes
               4 Dir(s)  114 771 439 616 bytes free


**counting the number of lines in a file**

In [4]:
count_total = ! findstr /R /N "^" ".\BLU01 - Messy Data\data\exercises\elon_musk.txt" | find /C ":"
count_total = int(count_total[0])
count_total

18

## 2. File Formats Introduction  

**Pandas has a set of functions to read data from files into pandas DataFrames**.
Their signature looks like

```
pd.read_XXX(filepath, other arguments)
```

where XXX shoud be replaced with the file type: **csv** (for delimiter separated values files), **json**, **excel**, **html**, among others.

In [5]:
file_path = '.\BLU01 - Messy Data\data\exercises\portugal_urban_waste_per_inhabitant.csv'
try:
    df3 = pd.read_csv(file_path)
except:
    print('Ooops!!! We got an error!')

Ooops!!! We got an error!


What if we need to inspect the file because the pandas' read functions have found some errors in the file?

## 3. Inspect data and dealing with bad lines

### 3.1 Ignoring bad lines  
We can instruct pandas' read functions to ignore the bad lines (`error_bad_lines=False` to drop "bad lines" and avoid error).

In [6]:
df3 = pd.read_csv(file_path, error_bad_lines=False, index_col=0, na_values=['no-data', '999999'])
df3[0:9]

b'Skipping line 3: expected 3 fields, saw 6\nSkipping line 30: expected 3 fields, saw 5\n'


Unnamed: 0_level_0,Urban waste collection per inhabitant (kg/inhabitant),Selective urban waste collection per inhabitant (kg/inhabitant)
Years,Unnamed: 1_level_1,Unnamed: 2_level_1
1989,,
1991,425.7,1.5
1992,,1.7
1993,357.6,2.3
1994,,
1995,352.0,4.0
1996,371.7,5.3
1997,397.0,6.6
1998,413.2,8.1


### 3.2 Open and read the files  

We can open the file and inspect the lines that were skipped by `pd.read_csv()`.  

**option 1** (we get a **list of lines**): open file and use readline() method of file object [(see The Python Tutorial - 7.2.1 Methods of File Objects)](https://docs.python.org/3.7/tutorial/inputoutput.html#methods-of-file-objects).  


In [7]:
# create a file object f
f = open(file_path, 'r')

# read a line
line = f.readline()

# print the content the line
print(line)

# closing the file
f.close()

Years,Urban waste collection per inhabitant (kg/inhabitant),Selective urban waste collection per inhabitant (kg/inhabitant)



If you want to **read all the lines** of a file in a list you can also use **`list(f)`** or **`f.readlines()`**.

In [8]:
# create a file object f
f = open(file_path, 'r')

# create a list of lines
lines_list = f.readlines()

# print the content of specific lines
for i in range(0,5):
    print(lines_list[i])

# closing the file
f.close()

Years,Urban waste collection per inhabitant (kg/inhabitant),Selective urban waste collection per inhabitant (kg/inhabitant)

1989,no-data,no-data

1990,480,78.2,1990,,

1991,425.7,1.5

1992,999999,1.7



In [9]:
# create a file object f
f = open(file_path, 'r')

# create a list of lines
lines_list = list(f)

# print the content of specific lines
print(lines_list[0:5])

# closing the file
f.close()

['Years,Urban waste collection per inhabitant (kg/inhabitant),Selective urban waste collection per inhabitant (kg/inhabitant)\n', '1989,no-data,no-data\n', '1990,480,78.2,1990,,\n', '1991,425.7,1.5\n', '1992,999999,1.7\n']


**option 2** (we get a **list of lists**): Python's **csv** module **reader** function that can read a csv file into a list of lists. [(see csv library documentation)](https://docs.python.org/3/library/csv.html)

In [10]:
import csv

In [11]:
# create a file object f
f = open(file_path, 'r')

# read the csv into a list of lists
csv_list = list(csv.reader(f))

# these two lists correspond to the first two lines of the file
print(csv_list[:5])

[['Years', 'Urban waste collection per inhabitant (kg/inhabitant)', 'Selective urban waste collection per inhabitant (kg/inhabitant)'], ['1989', 'no-data', 'no-data'], ['1990', '480', '78.2', '1990', '', ''], ['1991', '425.7', '1.5'], ['1992', '999999', '1.7']]


In [12]:
# each inner list can only have up to n_elements
n_elements = 3
csv_list_clean = [i[:n_elements] for i in csv_list]
csv_list_clean[:5]

[['Years',
  'Urban waste collection per inhabitant (kg/inhabitant)',
  'Selective urban waste collection per inhabitant (kg/inhabitant)'],
 ['1989', 'no-data', 'no-data'],
 ['1990', '480', '78.2'],
 ['1991', '425.7', '1.5'],
 ['1992', '999999', '1.7']]

In [13]:
# we finally create a DataFrame, using the first list as the column names, and the other lists as data
df = pd.DataFrame(csv_list_clean[1:], columns=csv_list_clean[0])
df.head()

Unnamed: 0,Years,Urban waste collection per inhabitant (kg/inhabitant),Selective urban waste collection per inhabitant (kg/inhabitant)
0,1989,no-data,no-data
1,1990,480,78.2
2,1991,425.7,1.5
3,1992,999999,1.7
4,1993,357.6,2.3


In [14]:
# closing the file
f.close()

## 4. Dealing with big files

**Only reading n lines**: `pd.read_csv(file, nrows=3)`

**Read n random lines**  

1. Find the number of lines in the file: `n_total_rows`.
2. Find the number of rows to be skipped: `n_rows_to_skip = n_total_rows - n_rows_to_read`.
2. Sample the rows to be skipped into a list: `list_of_rows_to_skip = random.sample( range(1, n_total_rows-1), n_skipped_rows)`.
3. Use `pd.read_csv()` with argument `skiprows = list_of_rows_to_skip` 

In [22]:
import random

file_path = ".\BLU01 - Messy Data\data\pokemons\pokemons.csv"

# 1-Find number of rows
n_total_rows = ! findstr /R /N "^" ".\BLU01 - Messy Data\data\pokemons\pokemons.csv" | find /C ":"
n_total_rows = int(n_total_rows[0])


n_rows_to_read = 10
n_rows_to_skip = n_total_rows - n_rows_to_read

random.seed(42) # this is to get always the same sample. can be removed if we want the sample to change
list_of_rows_to_skip = random.sample(
    range(1, n_total_rows-1), # this is a range from the first row after the header, to the last row on the file
    n_rows_to_skip # this is the number of rows we want to sample, i.e, to skip
)

pd.read_csv( file_path , skiprows=list_of_rows_to_skip)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,93,Dodrio,Normal,Flying,60,110,70,60,60,100,1,False
1,151,Omastar,Rock,Water,70,60,125,115,70,55,1,False
2,213,Umbreon,Dark,,95,65,110,60,130,65,2,False
3,248,Houndoom,Dark,Fire,75,90,50,110,80,95,2,False
4,311,Breloom,Grass,Fighting,60,130,80,60,60,70,3,False
5,509,Mantyke,Water,Flying,45,20,50,60,120,50,4,False
6,543,Heatran,Fire,Steel,91,90,106,130,106,77,4,True
7,549,Manaphy,Water,,100,100,100,100,100,100,4,False
8,800,Volcanion,Fire,Water,80,110,120,130,90,70,6,True


**Read random percentage of lines**  

1. Provide a lambda function to the `skiprows` argument of `read_csv()`.
2. This function will be evaluated against each of the row indices.
3. The row will be skipped if the function returns `True`.

In [28]:
file_path = ".\BLU01 - Messy Data\data\pokemons\pokemons.csv"

# sample 1% of the rows
p = 0.01

random.seed(42)
pd.read_csv( file_path, skiprows = lambda to_skip: random.random() > p )

Unnamed: 0,19,Beedrill,Bug,Poison,65,90,40,45,80,75,1,False
0,124,Kangaskhan,Normal,,105,95,80,40,80,90,1,False
1,269,Mega Tyranitar,Rock,Dark,100,164,150,95,120,71,2,False
2,290,Silcoon,Bug,,50,35,55,25,25,15,3,False
3,297,Seedot,Grass,,40,40,50,30,30,30,3,False
4,368,Zangoose,Normal,,73,115,60,60,60,90,3,False
5,394,Mega Absol,Dark,,65,150,60,115,60,115,3,False
6,427,Mega Rayquaza,Dragon,Flying,105,180,100,180,100,115,3,True
7,481,Purugly,Normal,,71,82,64,64,59,112,4,False
8,684,Golett,Ground,Ghost,59,74,50,35,50,35,5,False


**Loading file chunk by chunk using `chunksize`**  

By specifying a `chunksize` to `read_csv()`, the return value will be an iterable object of type `TextFileReader` instead of a `DataFrame`, although each element of the iterator is itself a DataFrame.  

In [65]:
file_path = ".\BLU01 - Messy Data\data\exercises\euribor_interest_rates.csv"

chunks_iter = pd.read_csv( file_path, sep = '|' , chunksize = 5)
chunks_iter

<pandas.io.parsers.TextFileReader at 0x4659f0cb38>

In [66]:
list(chunks_iter)

[  Euribor 3 months Euribor 6 months Euribor 12 months  Years
 0             3,34             3,52              3,88   1999
 1             4,86             4,83              4,75   2000
 2             3,29             3,26              3,34   2001
 3             2,87             2,80              2,75   2002
 4             2,12             2,17              2,31   2003,
   Euribor 3 months Euribor 6 months Euribor 12 months  Years
 5             2,16             2,22              2,36   2004
 6             2,49             2,64              2,84   2005
 7             3,73             3,85              4,03   2006
 8             4,68             4,71              4,75   2007
 9             2,89             2,97              3,05   2008,
    Euribor 3 months Euribor 6 months Euribor 12 months  Years
 10             0,70             0,99              1,25   2009
 11             1,01             1,23              1,51   2010
 12             1,36             1,62              1,95   2011
 1

**note**: If we need to go through the data using a for loop, the iterator will work just for the first time. For more iterations we need to create more iterators or convert the iterator into a list as above.

In [72]:
chunks_iter1 = pd.read_csv( file_path, sep = '|' , chunksize = 5, nrows=10)
chunks_iter2 = pd.read_csv( file_path, sep = '|' , chunksize = 5, nrows=10)

for data_chunk in chunks_iter1:
    print(data_chunk)

for data_chunk in chunks_iter2:
    print(type(data_chunk))

  Euribor 3 months Euribor 6 months Euribor 12 months  Years
0             3,34             3,52              3,88   1999
1             4,86             4,83              4,75   2000
2             3,29             3,26              3,34   2001
3             2,87             2,80              2,75   2002
4             2,12             2,17              2,31   2003
  Euribor 3 months Euribor 6 months Euribor 12 months  Years
5             2,16             2,22              2,36   2004
6             2,49             2,64              2,84   2005
7             3,73             3,85              4,03   2006
8             4,68             4,71              4,75   2007
9             2,89             2,97              3,05   2008
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


In [84]:
chunks_iter = pd.read_csv( file_path, sep = '|' , chunksize = 5)

chunk_df_list = []
for data_chunk in chunks_iter:
    #data_chunk_filtered = data_chunk.loc[data_chunk['Years'] > 2011,:]
    data_chunk_filtered = data_chunk[data_chunk['Years'] > 2011]
    chunk_df_list.append(data_chunk_filtered)

final_pd = pd.concat(chunk_df_list, axis=0)
final_pd.head()

Unnamed: 0,Euribor 3 months,Euribor 6 months,Euribor 12 months,Years
13,19,32,54,2012
14,29,39,56,2013
15,8,17,33,2014
16,-13,-4,6,2015
17,-32,-22,-8,2016
