# Extracting data from a csv file

The first step in an ETL pipeline is extraction. Data comes in all types of different formats, the most common formats are csv files, JSON files, XML files, SQL databases, and the web. In this notebook, I will focuse on reading csv files using pandas and deal with some reading errors. 

# Part 1 reading errors

We use pandas [read_csv method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) to read csv file but sometime we could face some errors. Below are two examples of reading error and there solutions.

In [6]:
import pandas as pd

## ParserError

In [7]:
df_projects = pd.read_csv('./data/population_data.csv')

ParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 63


when reading the file, ParserError came up. It said,'Expected 3 fields in line 5, saw 63'. What might have happened in line 5? In this case print out the first few lines of the data file to see what the issue might be. 

Notice that you can't use the read_csv method from pandas and you'll need to use pure Python to read the file.

In [9]:
f = open('./data/population_data.csv')
for i in range(10):
    line = f.readline()
    print('line: ', i,  line)
f.close()   

line:  0 ﻿"Data Source","World Development Indicators",

line:  1 

line:  2 "Last Updated Date","2018-06-28",

line:  3 

line:  4 "Country Name","Country Code","Indicator Name","Indicator Code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017",

line:  5 "Aruba","ABW","Population, total","SP.POP.TOTL","54211","55438","56225","56695","57032","57360","57715","58055","58386","58726","59063","59440","59840","60243","60528","60657","60586","60366","60103","59980","60096","60567","61345","62201","62836","63026","62644","61833","61079","61032","62149","64622","68235","72504","76700","80324","83200","85451","87277","89005","90853","92898","94

The first four lines in the file are not properly formatted and don't contain data. Next, read in the data using the read_csv method. But this time, use the skiprows option.

In [11]:
df_population = pd.read_csv('./data/population_data.csv',skiprows=4)
df_population.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,Unnamed: 62
0,Aruba,ABW,"Population, total",SP.POP.TOTL,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,...,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0,105264.0,
1,Afghanistan,AFG,"Population, total",SP.POP.TOTL,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,...,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,35530081.0,
2,Angola,AGO,"Population, total",SP.POP.TOTL,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,...,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,29784193.0,
3,Albania,ALB,"Population, total",SP.POP.TOTL,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,...,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0,
4,Andorra,AND,"Population, total",SP.POP.TOTL,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,...,84462.0,84449.0,83751.0,82431.0,80788.0,79223.0,78014.0,77281.0,76965.0,


# Part 2 basic exploration

In [14]:
# check data shape
df_population.shape

(264, 63)

In [12]:
# Count the number of null values in each column
df_population.isnull().sum()

Country Name        0
Country Code        0
Indicator Name      0
Indicator Code      0
1960                4
1961                4
1962                4
1963                4
1964                4
1965                4
1966                4
1967                4
1968                4
1969                4
1970                4
1971                4
1972                4
1973                4
1974                4
1975                4
1976                4
1977                4
1978                4
1979                4
1980                4
1981                4
1982                4
1983                4
1984                4
1985                4
                 ... 
1989                4
1990                2
1991                2
1992                3
1993                3
1994                3
1995                2
1996                2
1997                2
1998                1
1999                1
2000                1
2001                1
2002                1
2003      

It looks like the last colunm 'Unamed:62' is pretty odd and contain all NA values.

In [13]:
#Count the number of null values in each row
df_population.isnull().sum(axis=1)

0      1
1      1
2      1
3      1
4      1
5      1
6      1
7      1
8      1
9      1
10     1
11     1
12     1
13     1
14     1
15     1
16     1
17     1
18     1
19     1
20     1
21     1
22     1
23     1
24     1
25     1
26     1
27     1
28     1
29     1
      ..
234    1
235    1
236    1
237    1
238    1
239    1
240    1
241    1
242    1
243    1
244    1
245    1
246    1
247    1
248    1
249    1
250    1
251    1
252    1
253    1
254    1
255    1
256    1
257    1
258    1
259    1
260    1
261    1
262    1
263    1
Length: 264, dtype: int64

And it looks like almost every row has only one null value. That is probably from the 'Unnamed: 62' column that doesn't have any relevant information in it. Next, drop the 'Unnamed: 62' column from the data frame.