# Introduction
Sources:
- Getting started: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html
This link contains information to install pandas, introduce pandas and the user guide.
- Python for Data Analysis by Wes McKinney (2nd edition used here) - Chapter 5

Pandas is a python library that facilitates data analysis.


In [28]:
import pandas as pd

In [29]:
type(pd)

module

In [30]:
dir(pd)

['BooleanDtype',
 'Categorical',
 'CategoricalDtype',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'DatetimeTZDtype',
 'ExcelFile',
 'ExcelWriter',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int16Dtype',
 'Int32Dtype',
 'Int64Dtype',
 'Int64Index',
 'Int8Dtype',
 'Interval',
 'IntervalDtype',
 'IntervalIndex',
 'MultiIndex',
 'NA',
 'NaT',
 'NamedAgg',
 'Period',
 'PeriodDtype',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseDtype',
 'StringDtype',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt16Dtype',
 'UInt32Dtype',
 'UInt64Dtype',
 'UInt64Index',
 'UInt8Dtype',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__getattr__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_config',
 '_hashtable',
 '_is_numpy_dev',
 '_lib',
 '_libs',
 '_np_version_under1p16',
 '_np_version_under1p17',
 '_np_version_under1p18',
 '_testing',
 '_tslib',
 '_ty

### Introduction to pandas data structure: **Series**
A Series is a 1-dimensional array-like object containing a sequence of values and an associated array of data labels, called the **index**. pandas's Index objects are:
- responsible for holding the axis labels and other metadata
- immutable
- allow for duplicates

Another way to think about a Series is as a fixed-length, ordered dictionary, as it is a mapping of index values to data values.

The index is written on the left and the values on the right. If we don't specify values for the index, the default one consists of the integers 0 to N-1.

In [2]:
obj = pd.Series([4, "Mathilde", 34, 23, True])

In [35]:
obj.head(2)

0           4
1    Mathilde
dtype: object

In [4]:
obj.values

array([4, 'Mathilde', 34, 23, True], dtype=object)

In [19]:
obj.index

RangeIndex(start=0, stop=5, step=1)

In [23]:
obj2 = pd.Series([46, "Matt", 1987, "Nebraska"], index = ["M", "A", "T", "T"])

In [7]:
obj2

M          46
A        Matt
T        1987
T    Nebraska
dtype: object

In [8]:
obj2.values

array([46, 'Matt', 1987, 'Nebraska'], dtype=object)

In [73]:
obj2.index

Index(['M', 'A', 'T', 'T'], dtype='object')

## Series indexing

In [82]:
obj[2]

34

In [79]:
obj[[0, 2, 3]]

0     4
2    34
3    23
dtype: object

In [10]:
obj2[0]

46

In [11]:
obj2[0:2]

M      46
A    Matt
dtype: object

In [81]:
obj2["T"]

T        1987
T    Nebraska
dtype: object

In [12]:
obj2*2

M                  92
A            MattMatt
T                3974
T    NebraskaNebraska
dtype: object

In [13]:
#Find out if "M" is in the index of obj
"M" in obj

False

In [14]:
"M" in obj2

True

### Change index in a Series

In [15]:
# Change index in obj2
obj2.index = ["MA", "CT", "NY", "NH"]
obj2

MA          46
CT        Matt
NY        1987
NH    Nebraska
dtype: object

### Convert dictionaries into Series
When converting from a dictionary into Series, the key becomes the index in sorted order.

In [16]:
sdata = {"Nebraska": "Mid-West", "Massachusetts":"New England", "California":"East Coast", "Florida":"South"}

In [17]:
obj3 = pd.Series(sdata)

In [18]:
obj3

Nebraska            Mid-West
Massachusetts    New England
California        East Coast
Florida                South
dtype: object

### Introduction to pandas data structure: **DataFrame**
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (i.e. numeric, string, boolean, etc).

A DataFrame has both a row and a column index. It can be thought of as a dictionary of Series all sharing the same index.

### Create a dataframe
One way to construct a DataFrame is from a dictionary of equal-length lists. The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order.

In [None]:
dataframe = pd.DataFrame(
{
    "Fruits": ["Bananas", "Raspberries", "Strawberries", "Oranges", "Apples", "Pears", "Apples"],
    "Color": ["Yellow", "Pink", "Red", "Orange", "Red", "Red", "Green"],
    "Quantity": [5, 23, 59, 12, 3, 6, 10],
    "Price": [0.49, 5, 4, 1.5, 1.79, 1.99, 1.99],
    "Good?": [False, True, True, True, True, True, False]
})

## Visualize the dataframe
The resulting DataFrame will have its index assigned automatically as with Series and the columns are placed in sorted order.


In [None]:
dataframe

In [None]:
dataframe.head(3)

***Note:*** To have the columns arranged in a specific order or visualize selected columns, it is possible to write the column heads in a list. If a column head is written in this list, but it does not appear in the dictionary, then it will appear with missing values in the result.

In [39]:
dataframe = pd.DataFrame(
{
    "Fruits": ["Bananas", "Raspberries", "Strawberries", "Oranges", "Apples", "Pears", "Apples"],
    "Color": ["Yellow", "Pink", "Red", "Orange", "Red", "Red", "Green"],
    "Quantity": [5, 23, 59, 12, 3, 6, 10],
    "Price": [0.49, 5, 4, 1.5, 1.79, 1.99, 1.99],
    "Good?": [False, True, True, True, True, True, False]
}, columns = ["Fruits","Quantity", "Price", "Color", "Good?", "No value"])

In [None]:
dataframe

## Get initial information on this table

In [None]:
dataframe.info()

In [None]:
dataframe.describe()

In [None]:
dataframe.columns

In [None]:
dataframe.index

In [None]:
dataframe.values

## Create a new column

In [40]:
dataframe["InSeason"] = [True, False, True, True, True, True, True]

In [41]:
dataframe

Unnamed: 0,Fruits,Quantity,Price,Color,Good?,No value,InSeason
0,Bananas,5,0.49,Yellow,False,,True
1,Raspberries,23,5.0,Pink,True,,False
2,Strawberries,59,4.0,Red,True,,True
3,Oranges,12,1.5,Orange,True,,True
4,Apples,3,1.79,Red,True,,True
5,Pears,6,1.99,Red,True,,True
6,Apples,10,1.99,Green,False,,True


### Assigning lists or arrays to a column

In [42]:
value = pd.Series([1, 2, 3, 4], index = [1, 3, 5, 0])

In [43]:
dataframe["No value"] = value

In [44]:
dataframe

Unnamed: 0,Fruits,Quantity,Price,Color,Good?,No value,InSeason
0,Bananas,5,0.49,Yellow,False,4.0,True
1,Raspberries,23,5.0,Pink,True,1.0,False
2,Strawberries,59,4.0,Red,True,,True
3,Oranges,12,1.5,Orange,True,2.0,True
4,Apples,3,1.79,Red,True,,True
5,Pears,6,1.99,Red,True,3.0,True
6,Apples,10,1.99,Green,False,,True


## Look at specific columns
A column in a dataframe can be retrieved as a Series either by attribute or dictionary-like notations.

In [None]:
#Method 1: Use attribute
dataframe.Fruits

In [None]:
#Method 2: Use dictionary-like notations
dataframe["Fruits"]

In [None]:
#Method 3: to see more than 1 selected column
dataframe[{"Fruits", "Quantity"}]

## Look at specific rows

In [None]:
dataframe_red = dataframe.loc[dataframe["Color"] == "Red"]

In [None]:
dataframe_red

In [None]:
dataframe_red.info()

In [None]:
dataframe_cheap = dataframe.loc[dataframe["Price"] < 2]

In [None]:
dataframe_cheap

In [None]:
#Using the index values
#Beware of the slicing
dataframe.loc[0:2]

In [None]:
dataframe.iloc[1:5,]

## Remove columns we don't need

In [None]:
dataframe.info()

In [None]:
# Method 1: use the del method
del dataframe["No value"]

In [None]:
dataframe

In [65]:
#Method 2: use the drop method
# Axis = 1 for columns
dataframe = dataframe.drop(["Good?", "InSeason"], axis = 1)

In [66]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Fruits    7 non-null      object 
 1   Quantity  7 non-null      int64  
 2   Price     7 non-null      float64
 3   Color     7 non-null      object 
 4   No value  4 non-null      float64
dtypes: float64(2), int64(1), object(2)
memory usage: 408.0+ bytes


## Remove rows we don't need
Use the index value to remove the unwanted row

In [70]:
dataframe.drop([4, 6])

Unnamed: 0,Fruits,Quantity,Price,Color,No value
0,Bananas,5,0.49,Yellow,4.0
1,Raspberries,23,5.0,Pink,1.0
2,Strawberries,59,4.0,Red,
3,Oranges,12,1.5,Orange,2.0
5,Pears,6,1.99,Red,3.0


## Rename column heads

In [None]:
dataframe = dataframe.rename(columns={"Price":"Price per pound", "Color":"Color when ripe"})

In [None]:
dataframe

### Sort data

In [None]:
dataframe.sort_values(by = ["Price", "Quantity"], ascending=False)

## Transpose the DataFrame (i.e. Swap rows and columns)

In [None]:
dataframe

In [None]:
dataframe.T

## Set **name** attributes for columns and indexes

In [None]:
dataframe

In [None]:
dataframe.index.name = "number"; dataframe.columns.name = "Characteristics"

In [None]:
dataframe.values

### Save the selection in a csv file
CSV stands for comma separated values

In [None]:
dataframe.to_csv("fruits.csv")

## To go further
### reindex

In [45]:
myseries = pd.Series([1,2,3,4,5,6,7], index = ['a', 'b', 'c', 'd', 'e', 'f', 'g'])

In [46]:
myseries

a    1
b    2
c    3
d    4
e    5
f    6
g    7
dtype: int64

In [53]:
myseries2 = myseries.reindex(['b', 'c', 'a', 'd', 'v', 'u'])

In [54]:
myseries2

b    2.0
c    3.0
a    1.0
d    4.0
v    NaN
u    NaN
dtype: float64

In [55]:
dataframe

Unnamed: 0,Fruits,Quantity,Price,Color,Good?,No value,InSeason
0,Bananas,5,0.49,Yellow,False,4.0,True
1,Raspberries,23,5.0,Pink,True,1.0,False
2,Strawberries,59,4.0,Red,True,,True
3,Oranges,12,1.5,Orange,True,2.0,True
4,Apples,3,1.79,Red,True,,True
5,Pears,6,1.99,Red,True,3.0,True
6,Apples,10,1.99,Green,False,,True


In [60]:
dataframe2 = dataframe.reindex([0, 1, 1, 2, 4, 7])

In [61]:
dataframe2

Unnamed: 0,Fruits,Quantity,Price,Color,Good?,No value,InSeason
0,Bananas,5.0,0.49,Yellow,False,4.0,True
1,Raspberries,23.0,5.0,Pink,True,1.0,False
1,Raspberries,23.0,5.0,Pink,True,1.0,False
2,Strawberries,59.0,4.0,Red,True,,True
4,Apples,3.0,1.79,Red,True,,True
7,,,,,,,


In [64]:
newcolumns = ['Fruits', 'Color', 'InSeason', 'Name']
dataframe2.reindex(columns = newcolumns)

Unnamed: 0,Fruits,Color,InSeason,Name
0,Bananas,Yellow,True,
1,Raspberries,Pink,False,
1,Raspberries,Pink,False,
2,Strawberries,Red,True,
4,Apples,Red,True,
7,,,,
