<center><b>©  Content is made available under the CC-BY-NC-ND 4.0 license. Christian Lopez, lopezbec@lafayette.edu<center>

![alt text](https://miro.medium.com/v2/resize:fit:770/1*pJnfAWcDbz7qnQr7at3jkw.png)


Most of the notebooks we are going to be using are inspired from existing notebooks that are available online and are made free for educational purposes. The work of Jake VandelPlas and others served as guide and inspiration for these notebooks [see his Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook/tree/master). Nonetheless, these notebooks should not be share without prior permission of the instructor. When working in an assignment always remember the Student Code of Conduct.



###**Instructions:**
- You will be using Python.

- Only modify the code that is within the comments:

`### START CODE HERE ###`

`### END CODE HERE ###`

- You need to run all the code cells on the notebok sequentially
- If you are asked to change/update a cell, change/update and run it to check if your result is correct.




# Introduction to Pandas



In previously we dove into detail on NumPy and its `ndarray` object, which enables efficient storage and manipulation of dense typed arrays in Python.
Here we'll build on this knowledge by looking in depth at the data structures provided by the Pandas library.Pandas is a newer package built on top of NumPy that provides an efficient implementation of a `DataFrame`.

``DataFrame``s are essentially multidimensional arrays with attached row and column labels, often with heterogeneous types and/or missing data.
As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.As we've seen, NumPy's `ndarray` data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks.

While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.

Pandas, and in particular its `Series` and `DataFrame` objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.

In this part of the book, we will focus on the mechanics of using `Series`, `DataFrame`, and related structures effectively.
We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus.

## Installing and Using Pandas

Installation of Pandas on your system requires NumPy to be installed, and if you're building the library from source, you will need the appropriate tools to compile the C and Cython sources on which Pandas is built.
Details on the installation process can be found in the [Pandas documentation](http://pandas.pydata.org/).
If you followed the advice outlined in the [Preface](00.00-Preface.ipynb) and used the Anaconda stack, you already have Pandas installed.

Once Pandas is installed, you can import it and check the version; here is the version used by this book:

In [1]:
import pandas
pandas.__version__

'2.2.2'

Just as we generally import NumPy under the alias `np`, we will import Pandas under the alias `pd`:

In [2]:
import numpy as np
import pandas as pd

# Introducing Pandas Objects

At a very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.
As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are.
Thus, before we go any further, let's take a look at these three fundamental Pandas data structures: the `Series`, `DataFrame`, and `Index`.


## The Pandas Series Object

A Pandas `Series` is a one-dimensional array of indexed data.
It can be created from a list or array as follows:

In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

Unnamed: 0,0
0,0.25
1,0.5
2,0.75
3,1.0


The `Series` combines a sequence of values with an explicit sequence of indices, which we can access with the `values` and `index` attributes.
The `values` are simply a familiar NumPy array:

In [4]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])




The `index` is an array-like object of type `pd.Index`, which we'll discuss in more detail momentarily:

In [5]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [6]:
data[1]

0.5

####**Exercise:**


Now update the p1 data frame to only contain the second and third row  values of the original pandas data frame (remember here you would need to use Python slicin with `:`)



In [7]:
p1=data
### START CODE HERE ### (≈ 1 line of code)
p1=data[1:3]
### END CODE HERE ###

In [8]:
try:
    print("p1=")
    print(p1)
    print("Shape="+ str(p1.shape))
except Exception as e:
    print("Something is not working right")
    print(e)

p1=
1    0.50
2    0.75
dtype: float64
Shape=(2,)


Expected output:
```
p1=
1    0.50
2    0.75
dtype: float64
Shape=(2,)
```

As we will see, though, the Pandas `Series` is much more general and flexible than the one-dimensional NumPy array that it emulates.

### Series as Generalized NumPy Array

From what we've seen so far, the `Series` object may appear to be basically interchangeable with a one-dimensional NumPy array.
The essential difference is that while the NumPy array has an *implicitly defined* integer index used to access the values, the Pandas `Series` has an *explicitly defined* index associated with the values.

This explicit index definition gives the `Series` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
So, if we wish, we can use strings as an index:

In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

Unnamed: 0,0
a,0.25
b,0.5
c,0.75
d,1.0


And the item access works as expected:

In [10]:
data['b']

0.5

We can even use noncontiguous or nonsequential indices:

In [11]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

Unnamed: 0,0
2,0.25
5,0.5
3,0.75
7,1.0


In [12]:
data[5]

0.5

### Series as Specialized Dictionary

In this way, you can think of a Pandas `Series` a bit like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a `Series` is a structure that maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas `Series` makes it more efficient than Python dictionaries for certain operations.

The `Series`-as-dictionary analogy can be made even more clear by constructing a `Series` object directly from a Python dictionary, here the five most populous US states according to the 2020 census:

In [13]:
population_dict = {'California': 39538223, 'Texas': 29145505,
                   'Florida': 21538187, 'New York': 20201249,
                   'Pennsylvania': 13002700}
population = pd.Series(population_dict)
population

Unnamed: 0,0
California,39538223
Texas,29145505
Florida,21538187
New York,20201249
Pennsylvania,13002700


From here, typical dictionary-style item access can be performed:

In [14]:
population['California']

39538223

Unlike a dictionary, though, the `Series` also supports array-style operations such as slicing:

In [15]:
population['California':'Florida']

Unnamed: 0,0
California,39538223
Texas,29145505
Florida,21538187


We'll discuss some of the quirks of Pandas indexing and slicing in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb).

## The Pandas DataFrame Object

The next fundamental structure in Pandas is the `DataFrame`.
Like the `Series` object discussed in the previous section, the `DataFrame` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.
We'll now take a look at each of these perspectives.

### DataFrame as Generalized NumPy Array
If a `Series` is an analog of a one-dimensional array with explicit indices, a `DataFrame` is an analog of a two-dimensional array with explicit row and column indices.
Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a `DataFrame` as a sequence of aligned `Series` objects.
Here, by "aligned" we mean that they share the same index.

To demonstrate this, let's first construct a new `Series` listing the area of each of the five states discussed in the previous section (in square kilometers):

In [16]:
area_dict = {'California': 423967, 'Texas': 695662, 'Florida': 170312,
             'New York': 141297, 'Pennsylvania': 119280}
area = pd.Series(area_dict)
area

Unnamed: 0,0
California,423967
Texas,695662
Florida,170312
New York,141297
Pennsylvania,119280


Now that we have this along with the `population` Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [17]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,39538223,423967
Texas,29145505,695662
Florida,21538187,170312
New York,20201249,141297
Pennsylvania,13002700,119280


Like the `Series` object, the `DataFrame` has an `index` attribute that gives access to the index labels:

In [18]:
states.index

Index(['California', 'Texas', 'Florida', 'New York', 'Pennsylvania'], dtype='object')

Additionally, the `DataFrame` has a `columns` attribute, which is an `Index` object holding the column labels:

In [19]:
states.columns

Index(['population', 'area'], dtype='object')

Thus the `DataFrame` can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

### DataFrame as Specialized Dictionary

Similarly, we can also think of a `DataFrame` as a specialization of a dictionary.
Where a dictionary maps a key to a value, a `DataFrame` maps a column name to a `Series` of column data.
For example, asking for the `'area'` attribute returns the `Series` object containing the areas we saw earlier:

In [20]:
states['area']

Unnamed: 0,area
California,423967
Texas,695662
Florida,170312
New York,141297
Pennsylvania,119280


Notice the potential point of confusion here: in a two-dimensional NumPy array, `data[0]` will return the first *row*. For a `DataFrame`, `data['col0']` will return the first *column*.
Because of this, it is probably better to think about ``DataFrame``s as generalized dictionaries rather than generalized arrays, though both ways of looking at the situation can be useful.
We'll explore more flexible means of indexing ``DataFrame``s in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb).

### Constructing DataFrame Objects

A Pandas `DataFrame` can be constructed in a variety of ways.
Here we'll explore several examples.

#### From a single Series object

A `DataFrame` is a collection of `Series` objects, and a single-column `DataFrame` can be constructed from a single `Series`:

In [21]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,39538223
Texas,29145505
Florida,21538187
New York,20201249
Pennsylvania,13002700


#### From a list of dicts

Any list of dictionaries can be made into a `DataFrame`.
We'll use a simple list comprehension to create some data:

In [22]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


Even if some keys in the dictionary are missing, Pandas will fill them in with `NaN` values (i.e., "Not a Number"; see [Handling Missing Data](03.04-Missing-Values.ipynb)):

In [23]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### From a dictionary of Series objects

As we saw before, a `DataFrame` can be constructed from a dictionary of `Series` objects as well:

In [24]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
California,39538223,423967
Texas,29145505,695662
Florida,21538187,170312
New York,20201249,141297
Pennsylvania,13002700,119280


#### From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a `DataFrame` with any specified column and index names.
If omitted, an integer index will be used for each:

In [25]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.524313,0.487685
b,0.91102,0.721171
c,0.014258,0.756334


#### From a NumPy structured array

We covered structured arrays in [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb).
A Pandas `DataFrame` operates much like a structured array, and can be created directly from one:

In [26]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [27]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


## The Pandas Index Object

As you've seen, the `Series` and `DataFrame` objects both contain an explicit *index* that lets you reference and modify data.
This `Index` object is an interesting structure in itself, and it can be thought of either as an *immutable array* or as an *ordered set* (technically a multiset, as `Index` objects may contain repeated values).
Those views have some interesting consequences in terms of the operations available on `Index` objects.
As a simple example, let's construct an `Index` from a list of integers:

In [28]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Index([2, 3, 5, 7, 11], dtype='int64')

### Index as Immutable Array

The `Index` in many ways operates like an array.
For example, we can use standard Python indexing notation to retrieve values or slices:

In [29]:
ind[1]

3

In [30]:
ind[::2]

Index([2, 5, 11], dtype='int64')

`Index` objects also have many of the attributes familiar from NumPy arrays:

In [31]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


One difference between `Index` objects and NumPy arrays is that the indices are immutable—that is, they cannot be modified via the normal means:

In [32]:
#IF YOU RUN THE CODE BELOW YOU WILL GET AN ERROR
# ind[1] = 0

This immutability makes it safer to share indices between multiple ``DataFrame``s and arrays, without the potential for side effects from inadvertent index modification.

### Index as Ordered Set

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic.
The `Index` object follows many of the conventions used by Python's built-in `set` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [33]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [34]:
indA.intersection(indB)

Index([3, 5, 7], dtype='int64')

In [35]:
indA.union(indB)

Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [36]:
indA.symmetric_difference(indB)

Index([1, 2, 9, 11], dtype='int64')

####**Exercise:**


 Create a Pandas Series using the list `[3.14, 2.71, 1.41, 1.73, 0.58]`.
        Assign it to a variable named `s1`.
        Use the index values `['a', 'b', 'c', 'd', 'e']`.



In [37]:
s1 = data
### START CODE HERE ###
s1 = pd.Series([3.14, 2.71, 1.41, 1.73, 0.58], index=['a', 'b', 'c', 'd', 'e'])
### END CODE HERE ###

In [38]:
try:
    print("s1=")
    print(s1)
    print("Shape="+ str(s1.shape))
except Exception as e:
    print("Something is not working right")
    print(e)

s1=
a    3.14
b    2.71
c    1.41
d    1.73
e    0.58
dtype: float64
Shape=(5,)


Expected output:
```
s1=
a    3.14
b    2.71
c    1.41
d    1.73
e    0.58
dtype: float64
Shape=(5,)
```

Convert the Series to a DataFrame and manipulate it:

Convert s1 into a DataFrame named df1 with a column name 'Values'.

Add a new column named 'Squared', which contains the square of each value in 'Values'.


In [39]:
### START CODE HERE ###
df1 = pd.DataFrame(s1, columns=['Values'])
df1['Squared'] = df1['Values'] ** 2
### END CODE HERE ###

In [40]:
try:
    print("df1=")
    print(df1)
    print("Shape="+ str(df1.shape))
except Exception as e:
    print("Something is not working right")
    print(e)

df1=
   Values  Squared
a    3.14   9.8596
b    2.71   7.3441
c    1.41   1.9881
d    1.73   2.9929
e    0.58   0.3364
Shape=(5, 2)


Expected output:
```
df1=
   Values  Squared
a    3.14   9.8596
b    2.71   7.3441
c    1.41   1.9881
d    1.73   2.9929
e    0.58   0.3364
Shape=(5, 2)
```



Extract and update a subset:

Create a new DataFrame p2 that contains only the second and third rows of df1 (use slicing).


In [41]:
### START CODE HERE ###
p2 = df1[1:3]  # Select second and third rows
### END CODE HERE ###

In [42]:
try:
    print("p2=")
    print(p2)
    print("Shape="+ str(p2.shape))
except Exception as e:
    print("Something is not working right")
    print(e)

p2=
   Values  Squared
b    2.71   7.3441
c    1.41   1.9881
Shape=(2, 2)


Expected output:
```
p2=
   Values  Squared
b    2.71   7.3441
c    1.41   1.9881
Shape=(2, 2)
```

###### **DO NOT DELETE, MODIFY, NOR USE THESE CODE CELLS**



In [43]:
# -*- coding: utf-8 -*-
#!wget https://raw.githubusercontent.com/lopezbec/intro_python_notebooks/main/Grading_PB.py
#import Grading_PB

try:
    p1
except:
    p1=None
try:
    s1
except:
    s1=None
try:
    df1
except:
    df1=None
try:
    p2
except:
    p2=None


#Grading_PB.GRADING(test,div2,mod,sqr,plus2,pwr,ex31,ex32,doub,mult,cube,ex33,ex34,ex35,ex36)

In [44]:
def GRADING(p1,s1,df1,p2):
    grades={"p1":False, "s1":False,"df1":False,
            "p2":False}
    try:
        if(p1[1]==0.5  and  p1[2]==0.75): grades["p1"]=True
    except: grades["p1"]=False
    try:
        if(s1.shape[0]==5   and s1['c']==1.41): grades["s1"]=True
    except: grades["s1"]=False
    try:
        if(df1.shape[1]==2 and df1['Squared'].iloc[1]==7.3441): grades["df1"]=True
    except: grades["df1"]=False
    try:
        if(p2.shape[1]==2 and p2['Values'].iloc[1]==1.41): grades["p2"]=True
    except: grades["p2"]=False


    for x in grades:
        print (x,':',grades[x])

In [45]:
GRADING(p1,s1,df1,p2)

p1 : True
s1 : True
df1 : True
p2 : True
