# DataFrame and Imports

In [29]:
# Loading necessary package/module

import numpy as np
import pandas as pd

## 1. Dataframes

### 1.1 Multiple ways to create a dataframe

Check [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for more ways. 

In [30]:
# Method 1: From raw data

df_1 = pd.DataFrame(data = {'x': np.arange(1, 6, 1),
                           'y':1,
                           })
df_1

Unnamed: 0,x,y
0,1,1
1,2,1
2,3,1
3,4,1
4,5,1


In [31]:
# Method 2: If there is relationship between columns
x = np.arange(1, 6, 1)
y = [1,1,1,1,1]
z = x*x+y

df_2 = pd.DataFrame(data = {'x': x,
                           'y':y,
                           'z':z})
df_2

Unnamed: 0,x,y,z
0,1,1,2
1,2,1,5
2,3,1,10
3,4,1,17
4,5,1,26


In [32]:
# Method 3: Use arrays
x = np.arange(1, 6, 1)
y = [1,1,1,1,1]
z = x*x+y

#.T is transpose
df_3 = pd.DataFrame(np.array([np.arange(1, 6, 1), [1,1,1,1,1], z]).T, columns = ['x', 'y', 'z'])
df_3

Unnamed: 0,x,y,z
0,1,1,2
1,2,1,5
2,3,1,10
3,4,1,17
4,5,1,26


**From DataFrame get arrays**: Especially useful when working on Neural Networks

In [33]:
# Convert dataframe into arrays (you will loose the column name and index)
df_3.values

array([[ 1,  1,  2],
       [ 2,  1,  5],
       [ 3,  1, 10],
       [ 4,  1, 17],
       [ 5,  1, 26]])

<hr style="border:1px solid black">
<hr style="border:1px solid black">

## 2. Data Import

pandas functions concerned with loading files (*Flat files consist of rows and each row is called a record*) into data frames:

* `read_table()` readsgeneral delimited file(*Delimited file is a  file used to store data, in which each line represents a single book, company, or other thing, and each line has fields separated by the delimiter into DataFrame*)

* `read_csv()` reads comma delimited files.

* `read_fwf()` reads fixed width files.

* `read_excel()` reads xlsx files.

* `read_json()` converts a JSON string to pandas object.

* `read_html()`converts an HTML table into a pandas DataFrame.

* `read_stata()` reads Stata file into DataFrame.

* `read_sas()` reads SAS file into DataFrame.

* `read_spss()` reads SPSS file into DataFrame.

* `read_sql()` reads SQL query or database table into a DataFrame.

* `read_hdf()` reads hierarchical data format (Standard for storing large numerical data).

More info [here](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

<hr style="border:1px solid black">

### 2.1 CSV files

In [34]:
#skipping rows

pd.read_csv("Data_3/hypo_1.csv", skiprows=[0,3])

Unnamed: 0,individual,sex,age,IQ,depression,health,weight
0,1,Male,21.0,120.0,Yes,Very good,150
1,3,Male,22.0,135.0,No,Average,135
2,4,Male,86.0,150.0,No,Very poor,140
3,5,Male,60.0,92.0,Yes,Good,110
4,6,Female,16.0,130.0,Yes,Good,110
5,7,Female,,150.0,Yes,Very good,120
6,8,Female,43.0,,Yes,Average,120
7,9,Female,22.0,84.0,No,Average,105
8,10,Female,80.0,70.0,No,Good,100


In [35]:
# skipping comments if there is a row with #.

pd.read_csv("Data_3/hypo_1.csv", comment="#")

Unnamed: 0,individual,sex,age,IQ,depression,health,weight
0,1,Male,21.0,120.0,Yes,Very good,150
1,2,Male,43.0,,No,Very good,160
2,3,Male,22.0,135.0,No,Average,135
3,4,Male,86.0,150.0,No,Very poor,140
4,5,Male,60.0,92.0,Yes,Good,110
5,6,Female,16.0,130.0,Yes,Good,110
6,7,Female,,150.0,Yes,Very good,120
7,8,Female,43.0,,Yes,Average,120
8,9,Female,22.0,84.0,No,Average,105
9,10,Female,80.0,70.0,No,Good,100


In [36]:
# If file has no column names
pd.read_csv("Data_3/hypo_1.csv", comment = "#", header = None)

Unnamed: 0,0,1,2,3,4,5,6
0,individual,sex,age,IQ,depression,health,weight
1,1,Male,21,120,Yes,Very good,150
2,2,Male,43,,No,Very good,160
3,3,Male,22,135,No,Average,135
4,4,Male,86,150,No,Very poor,140
5,5,Male,60,92,Yes,Good,110
6,6,Female,16,130,Yes,Good,110
7,7,Female,,150,Yes,Very good,120
8,8,Female,43,,Yes,Average,120
9,9,Female,22,84,No,Average,105


In [37]:
# Adding column names
pd.read_csv("Data_3/hypo_1.csv", comment = "#", names = ["a", "b", "c", "d", "e","f", "g"])

Unnamed: 0,a,b,c,d,e,f,g
0,individual,sex,age,IQ,depression,health,weight
1,1,Male,21,120,Yes,Very good,150
2,2,Male,43,,No,Very good,160
3,3,Male,22,135,No,Average,135
4,4,Male,86,150,No,Very poor,140
5,5,Male,60,92,Yes,Good,110
6,6,Female,16,130,Yes,Good,110
7,7,Female,,150,Yes,Very good,120
8,8,Female,43,,Yes,Average,120
9,9,Female,22,84,No,Average,105


In [38]:
## Missing values
pd.read_csv("Data_3/hypo_1.csv", comment = "#", na_values=[120])

Unnamed: 0,individual,sex,age,IQ,depression,health,weight
0,1,Male,21.0,,Yes,Very good,150.0
1,2,Male,43.0,,No,Very good,160.0
2,3,Male,22.0,135.0,No,Average,135.0
3,4,Male,86.0,150.0,No,Very poor,140.0
4,5,Male,60.0,92.0,Yes,Good,110.0
5,6,Female,16.0,130.0,Yes,Good,110.0
6,7,Female,,150.0,Yes,Very good,
7,8,Female,43.0,,Yes,Average,
8,9,Female,22.0,84.0,No,Average,105.0
9,10,Female,80.0,70.0,No,Good,100.0


<hr style="border:1px solid black">

### 2.2 Excel files

In [39]:
pd.read_excel("Data_3/datasets.xlsx") 

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [40]:
pd.read_excel("Data_3/datasets.xlsx", sheet_name="chickwts")

Unnamed: 0,weight,feed
0,179,horsebean
1,160,horsebean
2,136,horsebean
3,227,horsebean
4,217,horsebean
...,...,...
66,359,casein
67,216,casein
68,222,casein
69,283,casein


<hr style="border:1px solid black">

### 2.3 Stata

In [41]:
#pd.read_stata("Data_3/stata.DTA") # It takes a while to load so this is commented out

<hr style="border:1px solid black">

### 2.4 SAS

In [42]:
#pd.read_sas("sas_2.SAS7BDAT") # It takes a while to load so this is commented out

<hr style="border:1px solid black">

## 3. Write Dataframe to files

In [43]:
example = pd.read_csv("Data_3/hypo_1.csv", comment="#")

In [44]:
# Write to csv
example.to_csv("Data_3/example_1.csv")

In [45]:
example.to_excel("Data_3/example_2.xlsx")