# Unit 5 Lecture 1 -  Loading Data

ESI4628: Decision Support Systems for Industrial Engineers<br>
University of Central Florida
Dr. Ivan Garibay, Ramya Akula, Mostafa Saeidi, Madeline Schiappa, and Brett Belcher. 
https://github.com/igaribay/DSSwithPython/blob/master/DSS-Week05/Notebook/DSS-Unit05-Lecture01.2018.ipynb

# Reading data in different formats:

-```read_csv``` 

-```read_table```

-```read_excel```

-```read_html```

In [2]:
import pandas as pd

## ```read_csv```
We use <code>read_csv</code> to create a Panda DataFrame from an external _Comma-Separated Value (CSV)_ formated data file. For instance, see the example below, where a CSV file called __housing_dataset.csv__ is loaded using this method

In [9]:
csv_path = 'https://s3.amazonaws.com/dss-fall2018/housing_dataset.csv'
df = pd.read_csv (csv_path)
df.head()

Unnamed: 0,SalePrice,LotFrontage,LotArea,OverallQual,MasVnrArea,YearBuilt,BsmtUnfSF,YearRemodAdd,TotalBsmtSF,BsmtFinSF1,1stFlrSF
0,0.241078,0.150685,0.03342,0.666667,0.1225,0.949275,0.064212,0.883333,0.140098,0.125089,0.11978
1,0.203583,0.202055,0.038795,0.555556,0.0,0.753623,0.121575,0.433333,0.206547,0.173281,0.212942
2,0.261908,0.160959,0.046507,0.666667,0.10125,0.934783,0.185788,0.866667,0.150573,0.086109,0.134465
3,0.145952,0.133562,0.038561,0.666667,0.0,0.311594,0.231164,0.333333,0.123732,0.038271,0.143873
4,0.298709,0.215753,0.060576,0.777778,0.21875,0.927536,0.20976,0.833333,0.187398,0.116052,0.186095


One of the nice features of these data-reading functions such as <code>read_csv</code> is _Type Inference_. This means that we do not have to specity which columns are numeric, strings, etc.

In [3]:
df.dtypes

SalePrice       float64
LotFrontage     float64
LotArea         float64
OverallQual     float64
MasVnrArea      float64
YearBuilt       float64
BsmtUnfSF       float64
YearRemodAdd    float64
TotalBsmtSF     float64
BsmtFinSF1      float64
1stFlrSF        float64
dtype: object

In [12]:
csv_path = 'tips.csv'
df2 = pd.read_csv (csv_path)
df2.head()

IOError: File tips.csv does not exist

## Missing Data


The important point of reading files with any format is, considering missing data. pandas automatically fills missing data by returning NA or NULL.

The best way to check whether a ```DataFrame``` has any NaN values is by using ```.isnull``` function. 

In [11]:
pd.isnull(csv_path)

False

### ```read_table``` file

In [5]:
Text = pd.read_table ('https://s3.amazonaws.com/dss-fall2018/SampleTextFile.txt')
Text.head()

Unnamed: 0,"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus condimentum sagittis lacus, laoreet luctus ligula laoreet ut. Vestibulum ullamcorper accumsan velit vel vehicula. Proin tempor lacus arcu. Nunc at elit condimentum, semper nisi et, condimentum mi. In venenatis blandit nibh at sollicitudin. Vestibulum dapibus mauris at orci maximus pellentesque. Nullam id elementum ipsum. Suspendisse cursus lobortis viverra. Proin et erat at mauris tincidunt porttitor vitae ac dui."
0,"Donec vulputate lorem tortor, nec fermentum ni..."
1,"Nulla luctus sem sit amet nisi consequat, id o..."
2,Vestibulum ante ipsum primis in faucibus orci ...
3,"Etiam vitae accumsan augue. Ut urna orci, male..."
4,"Integer eu hendrerit diam, sed consectetur nun..."


In [6]:
# Reading text files in pieces and ask to show the result for 5 rows

Text = pd.read_table ('https://s3.amazonaws.com/dss-fall2018/SampleTextFile.txt', nrows = 5)
Text.head()

Unnamed: 0,"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus condimentum sagittis lacus, laoreet luctus ligula laoreet ut. Vestibulum ullamcorper accumsan velit vel vehicula. Proin tempor lacus arcu. Nunc at elit condimentum, semper nisi et, condimentum mi. In venenatis blandit nibh at sollicitudin. Vestibulum dapibus mauris at orci maximus pellentesque. Nullam id elementum ipsum. Suspendisse cursus lobortis viverra. Proin et erat at mauris tincidunt porttitor vitae ac dui."
0,"Donec vulputate lorem tortor, nec fermentum ni..."
1,"Nulla luctus sem sit amet nisi consequat, id o..."
2,Vestibulum ante ipsum primis in faucibus orci ...
3,"Etiam vitae accumsan augue. Ut urna orci, male..."
4,"Integer eu hendrerit diam, sed consectetur nun..."


In [7]:
pd.isnull(Text) 

Unnamed: 0,"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus condimentum sagittis lacus, laoreet luctus ligula laoreet ut. Vestibulum ullamcorper accumsan velit vel vehicula. Proin tempor lacus arcu. Nunc at elit condimentum, semper nisi et, condimentum mi. In venenatis blandit nibh at sollicitudin. Vestibulum dapibus mauris at orci maximus pellentesque. Nullam id elementum ipsum. Suspendisse cursus lobortis viverra. Proin et erat at mauris tincidunt porttitor vitae ac dui."
0,False
1,False
2,False
3,False
4,False


### Some more example 

#### example1:

In [8]:
pd.read_table('http://bit.ly/chiporders')


Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25


As you see in this example, the first row is a header row and this file has a header row.

In [9]:
# having the first five rows
order = pd.read_table('http://bit.ly/chiporders')
order.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


#### example2:

In [10]:
pd.read_table('http://bit.ly/movieusers')

Unnamed: 0,1|24|M|technician|85711
0,2|53|F|other|94043
1,3|23|M|writer|32067
2,4|24|M|technician|43537
3,5|33|F|other|15213
4,6|42|M|executive|98101
5,7|57|M|administrator|91344
6,8|36|M|administrator|05201
7,9|29|M|student|01002
8,10|53|M|lawyer|90703
9,11|39|F|other|30329


As you see in this example, the result is not clear, because it puts everything in one column. So you need to separate each column using separator in pandas.

In [11]:
pd.read_table('http://bit.ly/movieusers', sep='|')

Unnamed: 0,1,24,M,technician,85711
0,2,53,F,other,94043
1,3,23,M,writer,32067
2,4,24,M,technician,43537
3,5,33,F,other,15213
4,6,42,M,executive,98101
5,7,57,M,administrator,91344
6,8,36,M,administrator,05201
7,9,29,M,student,01002
8,10,53,M,lawyer,90703
9,11,39,F,other,30329


Result looks better than before and each of the fields are in their own column.

The other issue is that the first row is not header row, so you need to add 'header = None'. If you want to define header row, you need to create a python list like below:

In [12]:
header_name = ['id','age','gender','job','zip_code']

pd.read_table('http://bit.ly/movieusers', sep='|', header = None, names = header_name)

Unnamed: 0,id,age,gender,job,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,05201
8,9,29,M,student,01002
9,10,53,M,lawyer,90703


# Data cleaning and preparation

Data cleaning is the process of removing bad data in a dataset. This bad data includes incorrect and improperly formatted data as well as duplicated and missing data. 

In [13]:
# Example of a student survey dataset which includes incorrect and improperly formatted data.

csv_path = 'https://s3.amazonaws.com/dss-fall2018/Student_Survey.csv'

df = pd.read_csv (csv_path)
print (df)

     Year    Location      Education  Sample_Size Satisfactory
0  2017.0      Putnam  Middle School        659.0            Y
1  2018.0   Lexington  Middle School        649.0            N
2  2018.0   Lexington  Middle School        435.0            N
3  2017.0    Berkeley            NaN          NaN          NaN
4     NaN    Berkeley    High School        228.0            Y
5  2018.0    Berkeley  Middle School         20.0          NaN
6  2018.0  Washington    High School        437.0            N
7     NaN     Tremont    High School          NaN            Y
8  2016.0     Tremont    High School        220.0            Y


Let's take a look at the dataset. There are seven NA values in all columns. By using ```isnull``` function, pandas recognizes all missing value and return ```True```  

In [14]:
# Recognizing missing values

print (df.isnull())


    Year  Location  Education  Sample_Size  Satisfactory
0  False     False      False        False         False
1  False     False      False        False         False
2  False     False      False        False         False
3  False     False       True         True          True
4   True     False      False        False         False
5  False     False      False        False          True
6  False     False      False        False         False
7   True     False      False         True         False
8  False     False      False        False         False


### ```dropna``` method

```dropna``` is a method to filter missing data into dataset. Sometimes you need to work on your correct data and want to omit others.  

In [15]:
# Dropping all missing data by omitting rows and columns which include missing data

df.dropna()

Unnamed: 0,Year,Location,Education,Sample_Size,Satisfactory
0,2017.0,Putnam,Middle School,659.0,Y
1,2018.0,Lexington,Middle School,649.0,N
2,2018.0,Lexington,Middle School,435.0,N
6,2018.0,Washington,High School,437.0,N
8,2016.0,Tremont,High School,220.0,Y


In [16]:
# Dropping rows and columns which are all NA (In this example, there is no row includes all NA)

df.dropna(how = 'all')    #For rows
df.dropna(axis = 1, how ='all')     #For columns

Unnamed: 0,Year,Location,Education,Sample_Size,Satisfactory
0,2017.0,Putnam,Middle School,659.0,Y
1,2018.0,Lexington,Middle School,649.0,N
2,2018.0,Lexington,Middle School,435.0,N
3,2017.0,Berkeley,,,
4,,Berkeley,High School,228.0,Y
5,2018.0,Berkeley,Middle School,20.0,
6,2018.0,Washington,High School,437.0,N
7,,Tremont,High School,,Y
8,2016.0,Tremont,High School,220.0,Y


### ```fillna``` method 

```fillna``` is a method to fill missing data by any number or value.  

In [17]:
# filling missing data by '100' in 'Sample_Size' column. 

print (df['Sample_Size'])
df['Sample_Size'].fillna(100)

0    659.0
1    649.0
2    435.0
3      NaN
4    228.0
5     20.0
6    437.0
7      NaN
8    220.0
Name: Sample_Size, dtype: float64


0    659.0
1    649.0
2    435.0
3    100.0
4    228.0
5     20.0
6    437.0
7    100.0
8    220.0
Name: Sample_Size, dtype: float64

In [18]:
# filling missing data in column 'Year' by 2015.

df.fillna({'Year': 2015})

Unnamed: 0,Year,Location,Education,Sample_Size,Satisfactory
0,2017.0,Putnam,Middle School,659.0,Y
1,2018.0,Lexington,Middle School,649.0,N
2,2018.0,Lexington,Middle School,435.0,N
3,2017.0,Berkeley,,,
4,2015.0,Berkeley,High School,228.0,Y
5,2018.0,Berkeley,Middle School,20.0,
6,2018.0,Washington,High School,437.0,N
7,2015.0,Tremont,High School,,Y
8,2016.0,Tremont,High School,220.0,Y


### Removing Duplicates

Sometimes in a DataFrame, you have duplicate rows and you need to remove duplicate.

For having a sample of DataFrame with duplicate rows, let's duplicate one of the rows in the previous example (like row number 6):

In [19]:
# Creating a DataFrame with two same rows

raw_data = [['Jason','Miller',42,4,25],['Molly','Jacobson',52,24,94],['Tina','Alison',36,31,57],['Jake','Milner',24,2,62],
            ['Amy','Cooze',73,3,70],['Molly','Jacobson',52,24,94]]

df = pd.DataFrame (raw_data, columns = ['first_name', 'last_name','age','preTestScore','postTestScore'])
print (df)

  first_name last_name  age  preTestScore  postTestScore
0      Jason    Miller   42             4             25
1      Molly  Jacobson   52            24             94
2       Tina    Alison   36            31             57
3       Jake    Milner   24             2             62
4        Amy     Cooze   73             3             70
5      Molly  Jacobson   52            24             94


```duplicated()``` method returns a boolean value whether each row is a duplicate. In this DataFrame, the value for number 5 is 'True' and says row number 5 is duplicated.

In [20]:
df.duplicated()

0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool

```drop_duplicates()``` method returns DataFrame with duplicate rows removed

In [21]:
df.drop_duplicates()

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
0,Jason,Miller,42,4,25
1,Molly,Jacobson,52,24,94
2,Tina,Alison,36,31,57
3,Jake,Milner,24,2,62
4,Amy,Cooze,73,3,70


### Renaming axis indexes

```rename()``` method is useful, when you want to create a new version of your dataset without changing the original dataset.

In [22]:
df = pd.DataFrame ({'A':[1,2,3], 'B':[4,5,6]})
print(df)

   A  B
0  1  4
1  2  5
2  3  6


In [23]:
df.rename(index = str, columns = str.lower)


Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6


In [24]:
df.rename({0:'first',1:'second',2:'third'}, axis = 'index')

Unnamed: 0,A,B
first,1,4
second,2,5
third,3,6


In [25]:
df.rename(index = {0:'test1'}, columns = {'A':'sample1','B':'sample2'})

Unnamed: 0,sample1,sample2
test1,1,4
1,2,5
2,3,6


As you see, we created a new version of dataset in these examples, whereas the original version remains unchanged.   

In [26]:
# check the original dataset

df

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6
