Pandas and Exploratory Data Analysis

**Panda series**


Series are one of the two data structures offered by pandas that make our lives much easier as data scientists.

Series are a kind of hybrid between lists and dictionaries.

In [2]:
import pandas as pd

#Pandas is the "King of Data Analysis"

One big difference from lists is that each element in a Series has an associated index that is not necessarily a sequence of integers like in lists. In this respect, our Series are similar to dictionaries:


In [None]:
serie_1 = pd.Series([3, 7, 9, 8])
serie_1

The left column is our index, the right column is the data stored in the Series. The text at the bottom is the data type we have in our Series.


We can create Series with a custom index:


In [None]:
serie_2 = pd.Series([4, 7, 9, 8], index=[10, 11, 12, 13])

serie_2

We can even use strings in the index:


In [None]:
serie_3 = pd.Series([5, 8, 7, 2], index=['a', 'b', 'c', 'd'])

serie_3

Due to their similarity, we can even create Series using dictionaries:


In [None]:
data = {
    "John": 45,
    "Mark": 56,
    "Tony": 12,
    "Jenny": 49,
    "Frame P.": 12
}

serie_4 = pd.Series(data)

serie_4

Just like in lists, we can access our data using the indexing operator. The difference is that in a Series we have to include the loc operator to tell the Series that we are accessing it using the names of the indices:


In [None]:
serie_1.loc[2]


In [None]:
serie_2.loc[12]


In [None]:
serie_3.loc['c']


In [None]:
serie_4.loc['Jenny']


**DataFrames**

DataFrames are then two-dimensional data structures. They have rows and columns. There are countless ways to create DataFrames. We are going to learn one of them: list dictionaries.

Here we have a dictionary of lists:

In [1]:
data = {
    'column_1': [1,2,3,4,5],
    'column_2': [6,7,8,9,10],
    'column_3': [11,12,13,14,15],
    'column_4': [16,17,18,19,20]}

Let's convert it to a DataFrame:


In [3]:
df = pd.DataFrame(data)

df



Unnamed: 0,column_1,column_2,column_3,column_4
0,1,6,11,16
1,2,7,12,17
2,3,8,13,18
3,4,9,14,19
4,5,10,15,20


In [None]:
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])

df


To look at individual columns, we use the indexing operator and pass it the column name:

In [4]:
df['column_1']


0    1
1    2
2    3
3    4
4    5
Name: column_1, dtype: int64


The column we got is a pandas Series with a Name property.

We can also see more than one column by passing a list with the names of the columns we want in the order we want them:

In [None]:
df[['column_3', 'column_1']]

We use the words observe or see because indexing columns does not return a copy of those columns, but only a "view" of those columns, as if we were looking at them through a window.


In [None]:
df.loc['a']

In [None]:
df.loc[['c', 'a']]


In [None]:
df.loc['b':]


In [None]:
df.loc[['e', 'c'], ['column_4', 'column_2']]


**JSON FILES**

One of the most common formats in which we are going to find data sets is the JSON format. As you probably already know, the JSON format is quite similar to the format of Python dictionaries:


We are going to learn how to read JSON files and convert them to DataFrames.


In [5]:
import json


In [10]:
from google.colab import files 
  
  
f = files.upload()

Saving zomato_reviews-raw.json to zomato_reviews-raw (1).json


In [15]:
f = open('zomato_reviews-raw.json', 'r')


We then convert our JSON file into a Python dictionary:


In [16]:
json_data = json.load(f)


Then we close our file:


In [17]:
f.close()


And finally we pass the dictionary to pandas.DataFrame.from_dict to create a DataFrame:


In [20]:
df = pd.DataFrame.from_dict(json_data)


In [None]:
df

**Concat with DataFrames**


Many times we will have Series or DataFrames that we want to join in a single structure. For that we can use pd.concat. By concatenating Series, we can do the following:


In [None]:
serie_1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
serie_2 = pd.Series([4, 5, 6], index=['d', 'f', 'e'])

In [None]:
pd.concat([serie_1, serie_2], axis=0)


We can also concatenate horizontally:


In [None]:
pd.concat([serie_1, serie_2], axis=1)


We can name our columns to identify each one

In [None]:
pd.concat([serie_1, serie_2], axis=1, keys=['serie_1', 'serie_2'])


This happens if we concatenate horizontally using the same index:


In [None]:
serie_3 = pd.Series([7, 8, 9], index=['a', 'b', 'c'])

pd.concat([serie_1, serie_3], axis=1, keys=['serie_1', 'serie_3'])

If we vertically concatenate two Series that share the index, we have the problem of not being able to differentiate the indices:


In [None]:
pd.concat([serie_1, serie_3], axis=0)


Sometimes we want this, but when we don't, we can add a second index level to make a difference:


In [None]:
pd.concat([serie_1, serie_3], axis=0, keys=['serie_1', 'serie_3'])


This is called a Multi-Index. We can access a multiindex at only one level or at both:


In [None]:
series_concat = pd.concat([serie_1, serie_3], axis=0, keys=['serie_1', 'serie_3'])


In [None]:
series_concat.loc['serie_1']


In [None]:
series_concat.loc[('serie_1', 'b')]


The same concatenation principles apply to both Series and DataFrames. We are going to see them in action and do a practice so that everything is super clear.


In [None]:
data_1 = {
    'column_1': [1, 2, 3],
    'column_2': [4, 5, 6]
}

df_1 = pd.DataFrame(data_1, index=['a', 'b', 'c'])

df_1

In [None]:
data_2 = {
    'column_1': [7, 8, 9],
    'column_2': [10, 11, 12]
}

df_2 = pd.DataFrame(data_1, index=['d', 'e', 'f'])

df_2

We can join them vertically:


In [None]:
pd.concat([df_1, df_2], axis=0)


Horizontally:


In [None]:
pd.concat([df_1, df_2], axis=1)


If they have the same index, we avoid NaNs:


In [None]:
data_3 = {
    'column_3': [7, 8, 9],
    'column_4': [10, 11, 12]
}

df_3 = pd.DataFrame(data_3, index=['a', 'b', 'c'])

df_3

In [None]:
pd.concat([df_1, df_3], axis=1)


If we concatenate vertically with the same index, we cannot tell them apart:


In [None]:
data_4 = {
    'column_1': [7, 8, 9],
    'column_2': [10, 11, 12]
}

df_4 = pd.DataFrame(data_4, index=['a', 'b', 'c'])

df_4

In [None]:
pd.concat([df_1, df_4], axis=0)


We can add multi-indexes:


In [None]:
df_concat = pd.concat([df_1, df_4], axis=0, keys=['df_1', 'df_4'])

df_concat

And we can access them in the same way:


In [None]:
df_concat.loc['df_1']


In [None]:
df_concat.loc[('df_1', 'b')]


We can also join more than two DataFrames by adding them all to the list:


In [None]:
pd.concat([df_1, df_2, df_3, df_4], axis=1)
