<a href="https://colab.research.google.com/github/nameer1811/module2_intro_to_pandas/blob/main/2_1_introduction_to_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas

## Pandas provides Python data frames

* Popular and established
* Inspired by R dataframes
* Built on `numpy` for fast computation

In [2]:
import pandas as pd

## Our first dataframe

In [3]:
df = pd.DataFrame({"Names": ["Iverson", "Malone", "Bergen"],
                   "Python_mastery": [10, 5, 1.0],
                   "Love_of_R": [2, 5, 11],
                   "years_at_wsu": [4, 17, 5]})
df.head()

Unnamed: 0,Names,Python_mastery,Love_of_R,years_at_wsu
0,Iverson,10.0,2,4
1,Malone,5.0,5,17
2,Bergen,1.0,11,5


## Reading from a csv

* Most data sets will be read in from a csv or JSON data file
* `Pandas` provides `read_csv` and `read_json`

### Open a local file w/ relative path

In [4]:
# Won't work in colab
artists = pd.read_csv('./data/Artists.csv')
artists.head()

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,


### Open a web address

In [5]:
url = "https://github.com/MuseumofModernArt/collection/raw/master/Artists.csv"
artists =  pd.read_csv(url)
artists.head()

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,


# JSON data file

* Another (more modern) storage
* Here the data is stored in row `dict`

```{json}
[
{
  "ConstituentID": 1,
  "DisplayName": "Robert Arneson",
  "ArtistBio": "American, 1930–1992",
  "Nationality": "American",
  "Gender": "Male",
  "BeginDate": 1930,
  "EndDate": 1992,
  "Wiki QID": null,
  "ULAN": null
},
{
  "ConstituentID": 2,
  "DisplayName": "Doroteo Arnaiz",
  "ArtistBio": "Spanish, born 1936",
  "Nationality": "Spanish",
  "Gender": "Male",
  "BeginDate": 1936,
  "EndDate": 0,
  "Wiki QID": null,
  "ULAN": null
},
...
```

## `pandas` can read `json` data

In [6]:
# Won't work in colab
artists =  pd.read_json('./data/Artists.json')
artists.head()

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,


In [7]:
json_url = "https://github.com/MuseumofModernArt/collection/raw/master/Artists.json"
artists =  pd.read_json(json_url)
artists.head()

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,


## <font color="red"> Exercise 2.1.2 </font>
    
Use tab-completion and `help` to discover and explore two more methods of reading a file into a `Pandas` dataframe.


In [11]:
help(pd.read_excel)

Help on function read_excel in module pandas.io.excel._base:

read_excel(io, sheet_name: 'str | int | list[IntStrT] | None' = 0, header: 'int | Sequence[int] | None' = 0, names=None, index_col: 'int | Sequence[int] | None' = None, usecols=None, squeeze: 'bool | None' = None, dtype: 'DtypeArg | None' = None, engine: "Literal['xlrd', 'openpyxl', 'odf', 'pyxlsb'] | None" = None, converters=None, true_values: 'Iterable[Hashable] | None' = None, false_values: 'Iterable[Hashable] | None' = None, skiprows: 'Sequence[int] | int | Callable[[int], object] | None' = None, nrows: 'int | None' = None, na_values=None, keep_default_na: 'bool' = True, na_filter: 'bool' = True, verbose: 'bool' = False, parse_dates=False, date_parser=None, thousands: 'str | None' = None, decimal: 'str' = '.', comment: 'str | None' = None, skipfooter: 'int' = 0, convert_float: 'bool | None' = None, mangle_dupe_cols: 'bool' = True, storage_options: 'StorageOptions' = None) -> 'DataFrame | dict[IntStrT, DataFrame]'
    Rea

In [12]:
help(pd.read_html)

Help on function read_html in module pandas.io.html:

read_html(io: 'FilePath | ReadBuffer[str]', match: 'str | Pattern' = '.+', flavor: 'str | None' = None, header: 'int | Sequence[int] | None' = None, index_col: 'int | Sequence[int] | None' = None, skiprows: 'int | Sequence[int] | slice | None' = None, attrs: 'dict[str, str] | None' = None, parse_dates: 'bool' = False, thousands: 'str | None' = ',', encoding: 'str | None' = None, decimal: 'str' = '.', converters: 'dict | None' = None, na_values=None, keep_default_na: 'bool' = True, displayed_only: 'bool' = True) -> 'list[DataFrame]'
    Read HTML tables into a ``list`` of ``DataFrame`` objects.
    
    Parameters
    ----------
    io : str, path object, or file-like object
        String, path object (implementing ``os.PathLike[str]``), or file-like
        object implementing a string ``read()`` function.
        The string can represent a URL or the HTML itself. Note that
        lxml only accepts the http, ftp and file url proto

> We found out that not all file types are supported by pandas but there are plenty for a data scientist to work off of. We found that pandas accept the excel formatted sheets and also HTML files which could be useful for webscraping.

## <font color="red"> Exercise 2.1.2 </font>
    
Read in the `Artwork.csv` from [https://github.com/MuseumofModernArt/collection](https://github.com/MuseumofModernArt/collection) and display the head of the resulting dataframe.


In [13]:
url = "https://github.com/MuseumofModernArt/collection/raw/master/Artworks.csv"
artwork = pd.read_csv(url)
artwork.head()

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,
2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, pen, color pencil, ink, and gouache ...",...,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,,,,34.3,,,31.8,,
3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1980,Photographic reproduction with colored synthet...,...,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...,,,,50.8,,,50.8,,
4,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, color pencil, ink, and gouache on tr...",...,http://www.moma.org/media/W1siZiIsIjEyNiJdLFsi...,,,,38.4,,,19.1,,


## So what is a `DateFrame`

* Like R, Pandas focuses on columns
* Think `dict` of `(str, Series)` pairs 
* A series is a typed list-like structure

In [14]:
# This is how I imagine a dataframe
df = pd.DataFrame({"Names": ["Iverson", "Malone", "Bergen"],
                   "Python_mastery": [10, 5, 1.0],
                   "years_at_wsu": [4.5, 17.5, 5.5]})

In [15]:
type(df)

pandas.core.frame.DataFrame

## Columns are `Series` and hold one type of data

In [16]:
type(artists.BeginDate), type(artists.DisplayName)

(pandas.core.series.Series, pandas.core.series.Series)

In [17]:
artists.BeginDate.dtype, artists.DisplayName.dtype

(dtype('int64'), dtype('O'))

## Two ways to access a column

* **Method 1:** like a dictionary
    * `df["column_name"]`
* **Method 2:** like an object attribute
    * `df.column_name`
    * Only for proper names!

In [18]:
artists.BeginDate.head(2)

0    1930
1    1936
Name: BeginDate, dtype: int64

In [19]:
artists['BeginDate'].head(2)

0    1930
1    1936
Name: BeginDate, dtype: int64

## More on data types

* See all data types with `df.dtypes`
* You can set the `dtypes` when you read a dataframe
* Read more about types: [Pandas docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html)

In [20]:
artists.dtypes

ConstituentID      int64
DisplayName       object
ArtistBio         object
Nationality       object
Gender            object
BeginDate          int64
EndDate            int64
Wiki QID          object
ULAN             float64
dtype: object

## Setting `dtypes` with `read_csv`

We can pass a `dict` of types to `dtype` keyword

In [24]:
import numpy as np
artist_types = {'ConstituentID': np.int64,
                'DisplayName': str,
                'ArtistBio': str,
                'Nationality': str,
                'Gender':str,
                'BeginDate': np.int64,
                'EndDate': np.int64,
                'Wiki QID': str,
                'ULAN':pd.Int64Dtype()} # If you get an error ==> update pandas (see below)
artists2 = pd.read_csv('./data/Artists.csv', dtype = artist_types)
artists2.head()

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,


## What's up with `ULAN` ?

* Currently, `numpy` $\rightarrow$ no missing `int`s
* Pandas correct this with the `pd.Int64Dtype()` type
    * Only available in `pandas >= 0.24.0`

In [25]:
pd.__version__

'1.4.3'

## An `Int` by any other name ...

* `np.int64` $\rightarrow$ no missing values
* `pd.Int64Dtype()` $\rightarrow$ allows `NaN`

In [26]:
artists2.dtypes

ConstituentID     int64
DisplayName      object
ArtistBio        object
Nationality      object
Gender           object
BeginDate         int64
EndDate           int64
Wiki QID         object
ULAN              Int64
dtype: object

## Preview of coming attractions

* Now we can switch `BeginDate` and `EndDate` from `0` to `np.NaN`
* We will do this in the next section

# Getting to know your data

## Basic inspection tools

* `df.head()`        first five rows
* `df.tail()`        last five rows
* `df.sample(5)`     random sample of rows
* `df.shape`         number of rows/columns in a tuple
* `df.describe()`    calculates measures of central tendency
* `df.info()`

## <font color="red"> Exercise 1: Inspect the artwork from MoMA </font>

#### Read the csv and inspect the `head`

In [27]:
artwork.head()

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,
2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, pen, color pencil, ink, and gouache ...",...,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,,,,34.3,,,31.8,,
3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1980,Photographic reproduction with colored synthet...,...,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...,,,,50.8,,,50.8,,
4,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, color pencil, ink, and gouache on tr...",...,http://www.moma.org/media/W1siZiIsIjEyNiJdLFsi...,,,,38.4,,,19.1,,


**Task:** Write a few sentences describing an problems

>There are a lot of NaN values in the table. Such as circumference, depth, length, weight, and so on. These are a potential problem.

#### Inspect the column names with the `columns` attribute

In [28]:
artwork.columns

Index(['Title', 'Artist', 'ConstituentID', 'ArtistBio', 'Nationality',
       'BeginDate', 'EndDate', 'Gender', 'Date', 'Medium', 'Dimensions',
       'CreditLine', 'AccessionNumber', 'Classification', 'Department',
       'DateAcquired', 'Cataloged', 'ObjectID', 'URL', 'ThumbnailURL',
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)'],
      dtype='object')

**Question:** See any problems?

>These look like normal columns and as far as I am seeing there is not a problem here. I could be wrong. Need to look at it more to come to a conclusion.

#### Inspect the tail

In [29]:
artwork.tail()

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
139932,48 Jugendstil postcards,Various Artists,6104.0,(Nationality unknown),(Nationality unknown),(0),(0),(),c. 1898,,...,,,,,9.0,,,14.0,,
139933,48 Jugendstil postcards,Various Artists,6104.0,(Nationality unknown),(Nationality unknown),(0),(0),(),c. 1898,,...,,,,,9.0,,,14.0,,
139934,48 Jugendstil postcards,Various Artists,6104.0,(Nationality unknown),(Nationality unknown),(0),(0),(),c. 1898,,...,,,,,9.0,,,14.0,,
139935,48 Jugendstil postcards,Various Artists,6104.0,(Nationality unknown),(Nationality unknown),(0),(0),(),c. 1898,,...,,,,,9.0,,,14.0,,
139936,Untitled,,,,,,,,1993,"Chromogenic print, printed 2012",...,,,,,,,,,,


#### Check out the `shape`

In [35]:
artwork.shape

(139937, 29)

**Question:** What do these number mean?

>These number shows the row and the column. There are 139937 rows or records and 29 columns or features.

#### Use `describe` to compute statistics

In [31]:
artwork.describe()

Unnamed: 0,ObjectID,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
count,139937.0,10.0,15166.0,1464.0,122889.0,742.0,287.0,121968.0,0.0,1949.0
mean,102094.204656,44.86802,15.227029,23.226111,37.420501,89.687579,1299.039319,37.935016,,6081.282
std,90877.170308,28.631604,52.885909,44.902258,49.478379,329.428165,12079.491475,67.180817,,143441.8
min,2.0,9.9,0.0,0.635,0.0,0.0,0.09,0.0,,0.0
25%,37161.0,23.5,0.0,7.7788,17.8,17.1,5.8968,17.5,,120.0
50%,74919.0,36.0,0.0,13.6525,27.8,26.7,20.8655,25.4,,420.0
75%,144644.0,71.125,7.8,25.0,43.815088,79.7,80.9671,44.132588,,1500.0
max,436925.0,83.8,1808.483617,914.4,9140.0,8321.0566,185067.585957,9144.0,,6283065.0


#### Use `info` to look at types and totals

In [32]:
artwork.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139937 entries, 0 to 139936
Data columns (total 29 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Title               139898 non-null  object 
 1   Artist              138711 non-null  object 
 2   ConstituentID       138711 non-null  object 
 3   ArtistBio           134091 non-null  object 
 4   Nationality         138711 non-null  object 
 5   BeginDate           138711 non-null  object 
 6   EndDate             138711 non-null  object 
 7   Gender              138711 non-null  object 
 8   Date                137812 non-null  object 
 9   Medium              130213 non-null  object 
 10  Dimensions          130606 non-null  object 
 11  CreditLine          138084 non-null  object 
 12  AccessionNumber     139937 non-null  object 
 13  Classification      139936 non-null  object 
 14  Department          139937 non-null  object 
 15  DateAcquired        133251 non-nul

**Question:** What did you learn from the last two cells?

>The last two cells show us whether the columns have all records as NaN or not. We can see that the columns have computed statistics based on the values it had. The last cell shows us how many non-null values the columns have, so that we have a better understanding of our data.