# Data Analysis in Python - I: Introduction to pandas

## References for Data Analysis in Python
1. [Pandas reference](https://pandas.pydata.org/).
2. Chen. D.Y. (2017).  [Pandas for Everyone: Python Data Analysis, First Edition](https://learning.oreilly.com/library/view/pandas-for-everyone/9780134547046/?sso_link=yes&sso_link_from=pennsylvania-state-university). Addison-Wesley Professional.
3. McKinney. W. (2018). [Python for Data Analysis](https://ebookcentral.proquest.com/lib/pensu/detail.action?docID=5061179#). O'Reilly Media, Inc.
4. Pandas Essential Training and Pandas for Data Science courses from [LinkedinLearning](https://www.linkedin.com/learning/collections/6825816434631839745) (Use Penn State credentials to sign-in).
5. Data Analysis and Visualization with Python learning path at [DataQuest](http://www.dataquest.io) (create personal account, if you want to use this resource).

## Introduction

In this lesson, we will learn about a Python library specially designed for data analysis. We will start by creating our own data sets to analyze and will learn how to access data from these datasets. Specifically, we will cover the following topics.
1. Introduction to pandas
2. Installing and importing pandas
3. Series and DataFrame data types
4. Creating series using lists
5. Creating data frames using lists and series.
6. Checking the size of a data frame. 

Note: 
1. Use the TOC to navigate between sections.


## The `pandas` library

`pandas` is a Python library (collection of reusable code) designed to make the analysis of structured data easier. In the absence of pandas, we would have to write our own code for functionality that has been automated/ made easier in pandas. pandas makes it easier to read and write data, and store it in special data structures (Series and Data Frames) that make it easier to manipulate and analyze the data.

### Installing and importing pandas

Before we can use pandas, we must install it. There are several ways to do this. We will use the Python package manager `pip`. pandas has already been installed in the online lab environment but you will need to install it on your local instance. To do so, start a terminal instance and execute the following command. 
```
pip install pandas
```

Once `pandas` is installed, you will need to import it in any notebook that uses pandas. To import pandas, type the following code.
```
import pandas as pd
```
The 'as pd' part is optional. If you don't use it, you will need to type pandas to invoke pandas functions. If you create an alias such as `pd`, you will type `pd` to invoke functions. E.g., 

```
pandas.DataFrame()
```
vs. 
```
pd.DataFrame()
```

In [1]:
# import the pandas library below
import pandas as pd

## pandas objects - Data Frame and Series

### Data Frame

pandas has created a new data type (type of object) called a DataFrame. A DataFrame can be thought of as a spreadsheet or table or rectangular data. It has rows and columns and allows users to perform various operations on the rows and columns. 

### Series

pandas also has an object type called Series. It is easiest to think of a series as a column of a DataFrame. A DataFrame can be thought of as a collection or dictionary of Series. Series are very similar to lists but contain values of the same type.

## Creating Series

One of the ways to create a series is using a list.

In [3]:
# import pandas if not already done
# import pandas as pd

# create an s_names series with values Amir, Biko, Chen, Darren
s_names = pd.Series(['Amir','Biko','Chen','Darren'])

# create an l_ages list with values 30, 32, 29, 30
l_ages = [30,32,29,30]

# create an s_ages series using l_ages
s_ages = pd.Series(l_ages)


#print the two series
print(s_names)
print(s_ages)

0      Amir
1      Biko
2      Chen
3    Darren
dtype: object
0    30
1    32
2    29
3    30
dtype: int64


Notice in the print output that each item in the series has a number printed to its left. This serves as a *data label* for each item and is called an `index`. The default index is a number consisting of integers from 0 to N-1 (where N is the number of elements in the series), but you can also assign a descriptive text if you want. 

In [4]:
# create s_ages2 series to store ages and assign names as the index. 

s_ages2 = pd.Series(
                    [30,32,29,30],
                     index=['Amir','Biko','Chen','Darren']
)
print(s_ages2)

Amir      30
Biko      32
Chen      29
Darren    30
dtype: int64


You can separately print the values and indices for a series.

In [7]:
# print the values and indices for the s_ages and s_ages2 series

print(s_ages.values)
print(s_ages.index)

print(s_ages2.values)
print(s_ages2.index)


[30 32 29 30]
RangeIndex(start=0, stop=4, step=1)
[30 32 29 30]
Index(['Amir', 'Biko', 'Chen', 'Darren'], dtype='object')


You can access individual elements of a series using their indices. 

In [8]:
# retrieve the age of student Biko

print(s_ages[1])
print(s_ages2['Biko'])

32
32


## Creating DataFrames

A DataFrame can be created as a dictionary of Series (or list) objects. A dictionary is a collection of key-value pairs. You can specify the key to retrieve the value corresponding to the key. A DataFrame can be thought of as a dictionary where a column name is a key and the contents of the column are the value. 

In [12]:
# Create a student DataFrame that contains names, ages and email
s_names = pd.Series(['Amir', 'Biko','Chen','Darren'])
l_ages = [30,32,29,30]

students = pd.DataFrame({'Name':s_names,'Age':l_ages,'Email':['a1@psu.edu','b2@psu.edu','c@psu.edu','5@psu.edu']})

print(students)

     Name  Age       Email
0    Amir   30  a1@psu.edu
1    Biko   32  b2@psu.edu
2    Chen   29   c@psu.edu
3  Darren   30   5@psu.edu


You can specify an order for columns using the `columns` option.

In [14]:
# copy code from above and specify a different column order
s_names = pd.Series(['Amir', 'Biko','Chen','Darren'])
l_ages = [30,32,29,30]

students = pd.DataFrame({'Name':s_names,'Age':l_ages,'Email':['a1@psu.edu','b2@psu.edu','c@psu.edu','5@psu.edu']}
                       ,columns = ['Name','Email','Age'])

print(students)

     Name       Email  Age
0    Amir  a1@psu.edu   30
1    Biko  b2@psu.edu   32
2    Chen   c@psu.edu   29
3  Darren   5@psu.edu   30


You can also change the default index. 

In [15]:
# create l_names list and assign as a column as well as the index

l_names = ['Amir', 'Biko','Chen','Darren']
l_ages = [30,32,29,30]

students = pd.DataFrame({'Name':l_names,'Age':l_ages,'Email':['a1@psu.edu','b2@psu.edu','c@psu.edu','5@psu.edu']}
                       ,columns = ['Name','Email','Age'], index=l_names)

print(students)

          Name       Email  Age
Amir      Amir  a1@psu.edu   30
Biko      Biko  b2@psu.edu   32
Chen      Chen   c@psu.edu   29
Darren  Darren   5@psu.edu   30


**Note:** If you are setting an index in the pd.DataFrame() function, use lists instead of Series to create the Dataframe. If you use a Series (which also has an index), there can be a conflict between the indices of the Series and the DataFrame resulting in NaN values.

While it is possible to specify a strict order for the rows, we won't study that at this point.

Let's look at one more way to create data frames from lists - using lists of lists.

In [17]:
# create lists that represent a row and add them to a DataFrame. 
student_info = [
    ['Amir', 30, 'a1@psu.edu'],
    ['Biko', 32, 'b10@psu.edu'],
    ['Chen', 29, 'c2@psu.edu'],
    ['Darren',30, 'd@psu.edu'],
    
]

students2 = pd.DataFrame(student_info,columns=['Name','Age','Email'])
print(students2)

     Name  Age        Email
0    Amir   30   a1@psu.edu
1    Biko   32  b10@psu.edu
2    Chen   29   c2@psu.edu
3  Darren   30    d@psu.edu


## Dimensions of a DataFrame

Often you will read data from a file into a DataFrame. When you do so, you may be interested in knowing how many rows and columns are in the DataFrame. You can find this information using the `shape` attribute of a DataFrame. 

In [18]:
# find the dimensions of the students data frame

print(students.shape)
print(type(students.shape))

(4, 3)
<class 'tuple'>


In [19]:
#no of rows

students.shape[0]

4

**Note: Attributes(Properties) vs. Methods(Functions)**

Attributes are properties of an object. Think of them as variables whose value tells you something about the object. The syntax for using an attribute is `object.attribute`.

Methods are functions associated to an object that act on the object. This action could display some results, modify the object and/or update the values of some attributes. The syntax for using a method is `object.function()`.


