# B. Data Structures in Pandas

Pandas is a Python package which offers powerful, expressive and flexible data structures and tools in data science. It makes data manipulation and analysis much easier than when not used.

In this lesson, we will be starting from a quick overview of the fundamental data structures in pandas, `Series` and `DataFrame`.


### _Objective_
1. **Series** : Understanding how to create Series objects with pandas 
2. **DataFrame** : Understanding how to create DataFrame objects with pandas 

Throughout the lesson, you'll be learning about pandas data structures using the following example data.

#### Example Data) Report cards

|Class   | Last Name | First Name| History | English | Math | Social Studies | Science |
|----| --- | ----   | --- |---| --- | --- | --- |
|1 | Smith | John |80 |92 |70 | 65 | 92 |
|1 | Schafer | Elise |91 |75 |90 | 68 | 85 | 
|2 | Zimmermann | Kate |86 |76 |42 | 72 | 88 |
|2 | Mendoza | James |77 |92 |52 | 60 | 80 |
|3 | Park | Jay |75 |85 |85 | 92 | 95 |
|3 | Delcourt | Emma |96 |90 |95 | 81 | 72 |
|4 | Thompson | Sarah |91 |81 |92 | 81 | 73 |

# \[1. Pandas Series\]

`Series` is a one-dimensional data structure in pandas and resembles one-dimensional Numpy arrays in many aspects. 

## 1. Creating a Pandas Series
+ When it comes to creating a Series, there are many options available. One of the simplest ways is to use other existing Python containers.

Since pandas is not a built-in library, you need to import it to the current working file before working with it. You can import `pandas` under the full name `pandas` or an alias `pd`.
In convention, the use of `pd` is preferred.

In [2]:
import pandas as pd
import numpy as np

### (1) Pandas Series from a list

Let's create a list for students' performance in the history exam.

In [3]:
score_list = [80,91,86,77,75,96,91]

`.Series()` is a pandas method that converts the input data into a pandas Series.

In [4]:
language_score = pd.Series(score_list)
language_score

0    80
1    91
2    86
3    77
4    75
5    96
6    91
dtype: int64

### (2) Pandas Series from a Numpy array.

`.Series()` can also convert a one-dimensional Numpy array into a pandas Series

In [5]:
values = np.array([80,91,86,77, 75,96,91])
language_score = pd.Series(values)
language_score

0    80
1    91
2    86
3    77
4    75
5    96
6    91
dtype: int32

### (3) Pandas Series from a Dictionary
Once a Python dictionary is passed to `pd.Series()`, pandas will automatically recognize the dictionary keys as Index values and dictionary values as elements of the Series.

In [6]:
score_dict = {
    "John":80,
    "Elise":91,
    "Kate":86,
    "James":77,
    "Jay":75,
    "Emma":96,
    "Sarah":91
}
language_score = pd.Series(score_dict)
language_score

John     80
Elise    91
Kate     86
James    77
Jay      75
Emma     96
Sarah    91
dtype: int64

## 2. Attributes of Pandas Series
A basic pandas Series is composed of 

- `values` - elements
- `index` -  label of an element 
- `name` - axis names

Let's take a closer look at each attribute.

### (1) `values` in a Series

In Series, a value means **the unique value of each element**.
To get values, use `.values` which will then return the result in a Numpy array regardless of which data type is used.

In [7]:
language_score.values

array([80, 91, 86, 77, 75, 96, 91], dtype=int64)

If you only want to check the data type of the given data, use `.dtype`

In [7]:
language_score.dtype

dtype('int64')

If you want to generate the summary statistics (or descriptive statistics) of the Series provided, use `.describe()`.




For numeric data, `.describe()` will show the following statistics.
- `count`(Count number of non-NA/null observations)
- `mean`(Mean of the values.)
- `std`(Standard deviation of the observations)
- `min`(Minimum of the values in the Series object)
- `25%`(1st quartile)
- `50%`(2nd quartile, median)
- `75%`(3rd quartile)
- `max`(Maximum of the values in the Series object.) 

In [8]:
language_score.describe()

count     7.000000
mean     85.142857
std       7.988086
min      75.000000
25%      78.500000
50%      86.000000
75%      91.000000
max      96.000000
dtype: float64

For object data such as strings and timestamps, `.describe()` can only summarize 

- `count` (Count number of non-NA/null observations)
- `unique` (Number of unique values in a Series)
- `top` (The most common value in a Series)
- `freq` (Frequency of the most common value)

In [9]:
pd.Series(["John","Elise","Kate","James","Jay","Emma","Sarah"]).describe()

count        7
unique       7
top       John
freq         1
dtype: object

### (2) `Index` of pandas Series

Index is the label of each element in a Series. The Index values need not be unique but must be hashable and of the same length as the data. That is, if a Series consists of 8 elements, there must be 8 index values as well.<br> By default, the index values are RangeIndex(0, 1, ..., n). However, if you have particular index values in your mind, you can use your own-defined index values by assigning them to the `index` keyword inside `.Series()`.

In [10]:
language_score = pd.Series([80,91,86,77,75,96,91],
                    index=["John","Elise","Kate","James", "Jay","Emma","Sarah"])
language_score

John     80
Elise    91
Kate     86
James    77
Jay      75
Emma     96
Sarah    91
dtype: int64

If you want to check the index values of a Series, use `.index`

In [11]:
language_score.index

Index(['John', 'Elise', 'Kate', 'James', 'Jay', 'Emma', 'Sarah'], dtype='object')

For Series indexing, you can use the syntax of Numpy array indexing by inserting index values between two square brackets. Let's assume you want to see John's langauge score. In this case, his name `John` is the index for his language score. So, the code for the access to his language score will be written as `language_score["John"]`.

In [12]:
language_score["John"]

80

Alternatively, you can use a <code>dot operator(**.**)</code> which you insert between the name of a Series and the index to be accessed to. In this case, a dot between `language_score` and `John` gives you the access to John's score in `language score`.

In [13]:
language_score.John

80

### (3) `name` of a Series

You can give a Series a name. This is especially helpful when you have to deal with multiple Series objects at the same time. <br>
Note that it is not the the same as the variable name. 

In [8]:
language_score = pd.Series([80,91,86,77,75,96,91],
                    index=["John","Elise","Kate","James","Jay","Emma","Sarah"],
                    name="History")
language_score 

# Here, the variable of the Series object is named `language_score` while the Series itself is named `History`

John     80
Elise    91
Kate     86
James    77
Jay      75
Emma     96
Sarah    91
Name: History, dtype: int64

----

# \[2. Pandas DataFrame\]
While pandas Series displays only one-dimensional data, pandas DataFrame displays data over two-dimensions like 2-dimensional Numpy arrays. This being so, pandas DataFrame is heterogeneous and can display **a collection of multiple Series objects of different data types**. 


## 1. Creating a DataFrame

Pandas DataFrame is a 2-dimensional data matrix.<br>Like for Series, you can utilize list, dict, np.array, and other Python data structures for creating DataFrame objects.

### (1) Pandas DataFrame from a list

`.DataFrame()` is a method that converts input data to a DataFrame.<br><br>

You can pass lists, or a collection of lists to `.DataFrame()` to create a DataFrame.

In [15]:
scores = [["1","Smith""John",80,92,70,65,92],
          ["1","Schafer", "Elise",91,75,90,68,85],
          ["2","Zimmermann", "Kate",86,76,42,72,88],
          ["2","Mendoza", "James",77,92,52,60,80],
          ["3","Park", "Jay",75,85,85,92,95],
          ["3","Delcourt", "Emma",96,90,95,81,72],
          ["4","Thompson", "Sarah",91,81,92,81,73]]

df = pd.DataFrame(scores)
df

Unnamed: 0,0,1,2,3,4,5,6,7
0,1,SmithJohn,80,92,70,65,92,
1,1,Schafer,Elise,91,75,90,68,85.0
2,2,Zimmermann,Kate,86,76,42,72,88.0
3,2,Mendoza,James,77,92,52,60,80.0
4,3,Park,Jay,75,85,85,92,95.0
5,3,Delcourt,Emma,96,90,95,81,72.0
6,4,Thompson,Sarah,91,81,92,81,73.0


### (2) Pandas DataFrame from a Numpy array

Likewise, `.DataFrame()` covnerts Numpy arrays to a DataFrame.


In [16]:
scores = np.array([["1","Smith""John",80,92,70,65,92],
          ["1","Schafer", "Elise",91,75,90,68,85],
          ["2","Zimmermann", "Kate",86,76,42,72,88],
          ["2","Mendoza", "James",77,92,52,60,80],
          ["3","Park", "Jay",75,85,85,92,95],
          ["3","Delcourt", "Emma",96,90,95,81,72],
          ["4","Thompson", "Sarah",91,81,92,81,73]])
df = pd.DataFrame(scores)
df

Unnamed: 0,0
0,"[1, SmithJohn, 80, 92, 70, 65, 92]"
1,"[1, Schafer, Elise, 91, 75, 90, 68, 85]"
2,"[2, Zimmermann, Kate, 86, 76, 42, 72, 88]"
3,"[2, Mendoza, James, 77, 92, 52, 60, 80]"
4,"[3, Park, Jay, 75, 85, 85, 92, 95]"
5,"[3, Delcourt, Emma, 96, 90, 95, 81, 72]"
6,"[4, Thompson, Sarah, 91, 81, 92, 81, 73]"


### (3) Pandas DataFrame from a Dictionary

`.DataFrame()` converts an input dictionary to a DataFrame by grouping dictionary keys as column labels and dictionary values as the element values.

In [17]:
scores = {
    "A" :["1","1","2","2","3","3","4"],
    "B" :["Smith", "Schafer", "Zimmermann", "Mendoza", "Park", "Delcourt", "Thompson"],
    "C":["John","Elise","Kate","James","Jay","Emma","Sarah"],
    "D":[80,91,86,77,75,96,91],
    "E":[92,75,76,92,85,90,81],
    "F":[70,90,42,52,85,95,92],
    "G":[65,68,72,60,92,81,81],
    "H":[92,85,88,80,95,72,73]
}
df = pd.DataFrame(scores) 
df

Unnamed: 0,A,B,C,D,E,F,G,H
0,1,Smith,John,80,92,70,65,92
1,1,Schafer,Elise,91,75,90,68,85
2,2,Zimmermann,Kate,86,76,42,72,88
3,2,Mendoza,James,77,92,52,60,80
4,3,Park,Jay,75,85,85,92,95
5,3,Delcourt,Emma,96,90,95,81,72
6,4,Thompson,Sarah,91,81,92,81,73


## 2. Attributes of Pandas DataFrame

A DataFrame consists of 
- `values` - elements
- `index and columns` - labels in each dimension

Let's take a look at each of these attributes.


### (1) `Values` in  a DataFrame

`.values` shows all values in the DataFrame provided.  


In [18]:
df.values

array([['1', 'Smith', 'John', 80, 92, 70, 65, 92],
       ['1', 'Schafer', 'Elise', 91, 75, 90, 68, 85],
       ['2', 'Zimmermann', 'Kate', 86, 76, 42, 72, 88],
       ['2', 'Mendoza', 'James', 77, 92, 52, 60, 80],
       ['3', 'Park', 'Jay', 75, 85, 85, 92, 95],
       ['3', 'Delcourt', 'Emma', 96, 90, 95, 81, 72],
       ['4', 'Thompson', 'Sarah', 91, 81, 92, 81, 73]], dtype=object)

To get general information about a DataFrame, `.info()` shows a concise summary of a DataFrame. The summary information includes columns, data type of each colum, non-null values and memory usage.

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       7 non-null      object
 1   B       7 non-null      object
 2   C       7 non-null      object
 3   D       7 non-null      int64 
 4   E       7 non-null      int64 
 5   F       7 non-null      int64 
 6   G       7 non-null      int64 
 7   H       7 non-null      int64 
dtypes: int64(5), object(3)
memory usage: 576.0+ bytes


For the statistical summary of a DataFrame, `.describe()` returns descriptive statistics on your data. Since the data type of some columns such as 'class', 'first_name' and 'last_name' are strings, the statistical summary on these will not be displayed.

In [9]:
df.describe()

NameError: name 'df' is not defined

### (2) Index for rows and columns of a DataFrame

A DataFrame is a 2-dimensional object with rows and columns. The row indices can be fetched by `.index` and column indices by `.columns`.

- Row indices = `.index`
- Column indices = `.columns` 

The above methods can also be used to rename the corresponding indices.

#### Row index / Row names

In [21]:
df.index

RangeIndex(start=0, stop=7, step=1)

Rename row indices as follows.

In [22]:
row_index = ["row1","row2","row3","row4","row5","row6","row7"]
df.index = row_index
df.index

Index(['row1', 'row2', 'row4', 'row5', 'row6', 'row7', 'row8'], dtype='object')

#### Column index / Column names

In [23]:
df.columns # getting column names

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'], dtype='object')

Rename column indices as follows.

In [24]:
col_index = ["col1","col2","col3","col4","col5","col6","col7", "col8"]
df.columns = col_index
df.columns

Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object')

In [25]:
df

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7,col8
row1,1,Smith,John,80,92,70,65,92
row2,1,Schafer,Elise,91,75,90,68,85
row4,2,Zimmermann,Kate,86,76,42,72,88
row5,2,Mendoza,James,77,92,52,60,80
row6,3,Park,Jay,75,85,85,92,95
row7,3,Delcourt,Emma,96,90,95,81,72
row8,4,Thompson,Sarah,91,81,92,81,73
