# Data Analysis with Pandas

## Building DataFrames

In [8]:
import pandas as pd

We can use lists and list of tupples to construct a dictionary. Then using this dictionary we can create a DataFarame with `pd.DataFrame()` function

In [17]:
list_keys = ['Key1', 'Key2']
list_values = [['Value1', 'Value2'], ['Value3', 'Value4']]
zipped = list(zip(list_keys, list_values)) # Zip the lists into one list of key-value tuples
data = dict(zipped) # Build a dictionary
print(data)

{'Key1': ['Value1', 'Value2'], 'Key2': ['Value3', 'Value4']}


In [18]:
df = pd.DataFrame(data) #Build a DataFrame from dictionary
print(df)

     Key1    Key2
0  Value1  Value3
1  Value2  Value4


New labels can be assigned to _columns_ in a DataFrame.

In [20]:
list_labels = ['year', 'temp']
df.columns = list_labels
print(df)

     year    temp
0  Value1  Value3
1  Value2  Value4


## Importing and Exporting Data

In the following example, 'iris' data is used.

In [40]:
file = 'iris.csv'
df = pd.read_csv(file)
print(df.shape)

(150, 5)


It is also possible, and sometimes necessary to read in the file specifying the _header_ and _names_ parameters. Note that other parameters such as _delimiter_ and _comment_ may be useful to tidy data.

In [43]:
col_labels = ['sep_len', 'sep_wid', 'pet_len', 'pet_wid', 'species']
df2 = pd.read_csv(file, header=0, names=col_labels)
print(df2.head())

   sep_len  sep_wid  pet_len  pet_wid species
0      5.1      3.5      1.4      0.2  Setosa
1      4.9      3.0      1.4      0.2  Setosa
2      4.7      3.2      1.3      0.2  Setosa
3      4.6      3.1      1.5      0.2  Setosa
4      5.0      3.6      1.4      0.2  Setosa


Saving the DataFrame to a CSV file without index is as follows:

In [None]:
df2.to_csv(file_clean, index=

## Inspecting Data

We can use `.head()` and `.tail()` methods to view first and last few rows of a Dataset. With `.info()` method, we can determine the total count of the non-null values and missing data.

In [47]:
print(df.head())

   sepal.length  sepal.width  petal.length  petal.width variety
0           5.1          3.5           1.4          0.2  Setosa
1           4.9          3.0           1.4          0.2  Setosa
2           4.7          3.2           1.3          0.2  Setosa
3           4.6          3.1           1.5          0.2  Setosa
4           5.0          3.6           1.4          0.2  Setosa


In [48]:
print(df.tail())

     sepal.length  sepal.width  petal.length  petal.width    variety
145           6.7          3.0           5.2          2.3  Virginica
146           6.3          2.5           5.0          1.9  Virginica
147           6.5          3.0           5.2          2.0  Virginica
148           6.2          3.4           5.4          2.3  Virginica
149           5.9          3.0           5.1          1.8  Virginica


In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal.length    150 non-null float64
sepal.width     150 non-null float64
petal.length    150 non-null float64
petal.width     150 non-null float64
variety         150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


_Pandas_ library depends upon NumPy. We can use `.values` attribute to represent a DataFrame as a NumPy array

In [51]:
import numpy as np

np_values = df.values
print(np_values[:5])

[[5.1 3.5 1.4 0.2 'Setosa']
 [4.9 3.0 1.4 0.2 'Setosa']
 [4.7 3.2 1.3 0.2 'Setosa']
 [4.6 3.1 1.5 0.2 'Setosa']
 [5.0 3.6 1.4 0.2 'Setosa']]
