# pandas
pandas is a fast, powerful, flexible and easy to use open source **data analysis and manipulation tool**, built on top of the Python programming language.

pandas makes working with “relational” or “labeled” data both easy and intuitive.

pandas blends the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases (such as
SQL). 

### What Is Data Science?
There’s a joke that says a data scientist is someone who knows more
statistics than a computer scientist and more computer science than a
statistician

### NumPy
NumPy, short for Numerical Python, has long been a cornerstone of numerical computing in Python. 
It provides the data structures, algorithms, and library glue needed for most scientific applications involving numerical data in Python. 
NumPy contains, among other things:
- A fast and efficient multidimensional array object `ndarray` (The "nd" in ndarray stands for "n-dimensional")
- Functions for performing element-wise computations with arrays or mathematical operations between arrays
- Tools for reading and writing array-based datasets to disk
- Linear algebra operations, Fourier transform, and random number generation
- A mature C API to enable Python extensions and native C or C++ code to access NumPy’s data structures and computational facilities

One of the reasons NumPy is so important for numerical computations in Python is because it is designed for **efficiency on large arrays of data**. There are a number of reasons for this:
- NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy’s library of algorithms written in the C lan‐
guage can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.
- NumPy operations perform complex computations on entire arrays without the need for Python for loops.


## Install
pandas can be installed via pip from PyPI.

`pip install pandas`

### Import Conventions
The Python community has adopted a number of naming conventions for commonly used modules:

In [78]:
import pandas as pd

### DataFrame
When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. 

In pandas, a data table is called a **DataFrame**.

We can simply think of a DataFrame by a list of rows, where is row is a list:
```
[
    [90, 'Turkey', 'Ankara'], 
    [44, 'United Kingdom', 'London'], 
    [1, 'United States', 'Washington'], 
    [81, 'Japan', 'Tokyo'], 
    [86, 'China', 'Beijing']
]
````

Another model is a dictionary, where keys are the "column names" and values are the lists:
```
(singular)  person = {
                    "name"      : "value",
                    "lastname"  : "value"
            }

(plural)    people = {
                    "name"      : ["value1", "value2"],
                    "lastname"  : ["value1", "value2"]
            }
```



In [79]:
# Manually build DataFrame
# Option 1 -
# data, as a list of lists - each item in the list is a list representing a ROW
country_codes = pd.DataFrame(
    [[90, 'Turkey', 'Ankara'], [44, 'United Kingdom', 'London'], [1, 'United States', 'Washington'], [81, 'Japan', 'Tokyo'], [86, 'China', 'Beijing']],
    index=['TR', 'UK', 'US', 'JP', 'CN'],
    columns=['code', 'name', 'capital']
)
'''
    code            name     capital
TR    90          Turkey      Ankara
UK    44  United Kingdom      London
US     1   United States  Washington
JP    81           Japan       Tokyo
CN    86           China     Beijing
'''
print(country_codes)

    code            name     capital
TR    90          Turkey      Ankara
UK    44  United Kingdom      London
US     1   United States  Washington
JP    81           Japan       Tokyo
CN    86           China     Beijing


### Option 2 - Create a DataFrame from a Python dict

from python dictionary:
```
{
    'LastName':['Booker', 'Grey', 'Johnson', 'Jenkins', 'Smith'],
    'Email': ['bo@example.com', 'gr@example.com', 'jo@example.com', 'je@example.com', 'sm@example.com'],
    'Username': ['booker12', 'grey07', 'johnson81', 'jenkins46', 'smith79']
}
```

to pandas DataFrame:
```
  LastName           Email   Username
0   Booker  bo@example.com   booker12
1     Grey  gr@example.com     grey07
2  Johnson  jo@example.com  johnson81
3  Jenkins  je@example.com  jenkins46
4    Smith  sm@example.com    smith79
```

In [80]:
# Option 2 - Create a DataFrame from a Python dict
# A seperate list representing each COLUMN
# person = {
#     "name"      : "value",
#     "lastname"  : "value"
#     }
# people = {
#     "name"      : ["value1","value2"],
#     "lastname"  : ["value1","value2"]
#     }
lastnames = ['Booker', 'Grey', 'Johnson', 'Jenkins', 'Smith']
emails = ['bo@example.com', 'gr@example.com', 'jo@example.com', 'je@example.com', 'sm@example.com']
usernames = ['booker12', 'grey07', 'johnson81', 'jenkins46', 'smith79']

# A dictionary where keys are the "column names" and values are the lists:
users_dict = {'LastName': lastnames, 'Email': emails, 'Username': usernames}

# Create a DataFrame from a Python dict:
# df = pd.DataFrame({'col1':[], 'col2':[]})
df_users = pd.DataFrame(users_dict)

print(df_users)


  LastName           Email   Username
0   Booker  bo@example.com   booker12
1     Grey  gr@example.com     grey07
2  Johnson  jo@example.com  johnson81
3  Jenkins  je@example.com  jenkins46
4    Smith  sm@example.com    smith79


## Which file formats?
pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,…). Importing data from each of these data sources is provided by function with the prefix `read_*`. Similarly, the `to_*` methods are used to store data.

In [81]:
df = pd.read_csv("data/users.csv")
print(df)

       Username                   Email    Id First name Last name Active
0      blu3FisH      rachel@example.com  9012     Rachel    Booker    Yes
1     GreenFr0g   laura@yourcompany.com  2070      Laura      Grey    Yes
2      blackd0g   craig@yourcompany.com  4081      Craig   Johnson     No
3     jenkins46        mary@example.com  9346       Mary   Jenkins    Yes
4       smith79       jamie@example.com  5079      Jamie     Smith    Yes
5        john00               js@co.com   303       John     Smith     No
6   BlackTurkey    jhalprin@example.com   304        Jim   Halprin    Yes
7      BlueHawk      tjones@example.com   305     Teresa     Jones    Yes
8     GreenTree    tomjones@example.com   306      Tommy     Jones    Yes
9    OrangeFish  greggjones@example.com   307      Gregg     Jones    Yes
10      RedBoat   dthompson@example.com   308     Daniel  Thompson    Yes


In [82]:
brics = pd.read_csv("data/brics.csv", index_col=0)


### Head and tail
To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number.

In [83]:
df.head()

Unnamed: 0,Username,Email,Id,First name,Last name,Active
0,blu3FisH,rachel@example.com,9012,Rachel,Booker,Yes
1,GreenFr0g,laura@yourcompany.com,2070,Laura,Grey,Yes
2,blackd0g,craig@yourcompany.com,4081,Craig,Johnson,No
3,jenkins46,mary@example.com,9346,Mary,Jenkins,Yes
4,smith79,jamie@example.com,5079,Jamie,Smith,Yes


In [84]:
df.tail(3)

Unnamed: 0,Username,Email,Id,First name,Last name,Active
8,GreenTree,tomjones@example.com,306,Tommy,Jones,Yes
9,OrangeFish,greggjones@example.com,307,Gregg,Jones,Yes
10,RedBoat,dthompson@example.com,308,Daniel,Thompson,Yes


### Attributes and underlying data



In [85]:
brics.describe()

Unnamed: 0,area,population
count,5.0,5.0
mean,7.944,601.176
std,6.200557,645.261454
min,1.221,52.98
25%,3.286,143.5
50%,8.516,200.4
75%,9.597,1252.0
max,17.1,1357.0


In [86]:
brics.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, BR to SA
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     5 non-null      object 
 1   capital     5 non-null      object 
 2   area        5 non-null      float64
 3   population  5 non-null      float64
dtypes: float64(2), object(2)
memory usage: 200.0+ bytes


`shape`: gives the axis dimensions of the object, consistent with ndarray
(entries, columns)

In [87]:
brics.shape

(5, 4)

In [88]:
brics.columns

Index(['country', 'capital', 'area', 'population'], dtype='object')

## Summary statistics
Basic statistics (mean, median, min, max, counts…) are easily calculable.

describe() shows a quick statistic summary of your data

In [134]:
city_pop = pd.read_csv("data/population_tr.csv", sep=';')
city_pop.describe()

Unnamed: 0,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
count,82.0,82.0,82.0,82.0,82.0,82.0,82.0,82.0,82.0,82.0,...,82.0,82.0,82.0,82.0,82.0,82.0,82.0,82.0,82.0,82.0
mean,1578768.0,1600077.0,1619557.0,1638713.0,1658786.0,1679525.0,1700731.0,1721616.0,1744320.0,1769788.0,...,1844570.0,1869948.0,1895022.0,1920513.0,1946704.0,1970988.0,2000095.0,2028171.0,2039375.0,2065373.0
std,7182483.0,7280900.0,7371050.0,7459777.0,7552731.0,7648774.0,7746996.0,7843821.0,7946958.0,8064122.0,...,8413460.0,8531728.0,8647138.0,8765726.0,8884413.0,8996369.0,9125385.0,9258787.0,9307607.0,9429832.0
min,75221.0,75517.0,75709.0,75868.0,76050.0,76246.0,76444.0,76609.0,75675.0,74710.0,...,75797.0,75620.0,80607.0,78550.0,82193.0,80417.0,82274.0,84660.0,81910.0,83645.0
25%,271939.5,274563.8,276847.2,278026.0,279218.8,279768.8,280003.0,280102.0,279918.5,281631.8,...,282107.5,283987.0,285154.2,287849.0,290063.2,288831.8,291243.2,289810.0,289932.8,295251.5
50%,483484.5,487549.5,490201.5,492244.0,490303.5,488386.5,486403.5,484127.5,484911.0,492681.5,...,511833.0,517204.0,519505.0,519260.5,525019.0,529419.5,538070.0,537479.0,539655.0,549800.5
75%,864218.2,875172.8,885107.8,894849.0,905071.8,915641.8,926446.8,937059.5,953584.0,965956.8,...,992545.0,1006584.0,1026159.0,1038490.0,1052617.0,1065313.0,1080791.0,1097082.0,1109579.0,1128873.0
max,64729500.0,65603160.0,66401850.0,67187250.0,68010220.0,68860540.0,69729970.0,70586260.0,71517100.0,72561310.0,...,75627380.0,76667860.0,77695900.0,78741050.0,79814870.0,80810520.0,82003880.0,83155000.0,83614360.0,84680270.0


## Series  
A Series is a one-dimensional sequence of labeled data.
One-dimensional ndarray with axis labels  
(Like a column in the table)

Note that the visual display of a Series is just **plain text**, as opposed to the nicely styled table for DataFrames. You will also see the data type or dtype of the Series.

### Selecting data with just the indexing operator

- Selecting a single column as a Series 
<br>To select a single column of data, simply put the name of the column in-between the brackets.
```
brics['capital']
brics.capital:
```

- Selecting multiple columns as a DataFrame: `brics[['country', 'capital']]`
<br>Pass a list as an argument to df index: [ [] ]

In [118]:
brics['capital']    # Series

code
BR     Brasilia
RU       Moscow
IN    New Delhi
CH      Beijing
SA     Pretoria
Name: capital, dtype: object

In [121]:
type(brics['capital'])  # pandas.core.series.Series


pandas.core.series.Series

In [123]:
brics[['country', 'capital']]

Unnamed: 0_level_0,country,capital
code,Unnamed: 1_level_1,Unnamed: 2_level_1
BR,Brazil,Brasilia
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing
SA,South Africa,Pretoria


In [125]:
type(brics[['country', 'capital']])  # pandas.core.frame.DataFrame


pandas.core.frame.DataFrame

### Select Row(s)
The `.loc` indexer can select subsets of rows or columns. 
Most importantly, it only selects data by the **LABEL** of the rows and columns.

- Select a single row as a Series with .loc       : `.loc['RU']`
- Select multiple rows as a DataFrame with .loc   : `.loc[['IN', 'RU']]`
- Use slice notation to select a range of rows: `.loc['IN':'SA']`
- elect a single row and a single column. This returns a scalar value: `brics.loc['RU', 'capitol']`
- Select 2 rows and 2 columns: `brics.loc[['IN', 'RU'],['country', 'capital']]`
- Select all of the rows and 2 columns: `brics.loc[:, ['country', 'capital']]`
- Select a column (all of the rows): `brics.loc[:, ['capital']]`

In [96]:
brics.loc['RU']

country       Russia
capital       Moscow
area            17.1
population     143.5
Name: RU, dtype: object

In [99]:
brics.loc[['RU', 'IN']]


Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0


In [117]:
brics.loc['IN':'SA']    # Note that .loc includes the last value with slice notation!

Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


In [101]:
brics.loc['RU', 'capital']

'Moscow'

In [100]:
brics.loc[['IN', 'RU'],['country', 'capital']]

Unnamed: 0_level_0,country,capital
code,Unnamed: 1_level_1,Unnamed: 2_level_1
IN,India,New Delhi
RU,Russia,Moscow


In [103]:
brics.loc[:, ['country', 'capital']]

Unnamed: 0_level_0,country,capital
code,Unnamed: 1_level_1,Unnamed: 2_level_1
BR,Brazil,Brasilia
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing
SA,South Africa,Pretoria


In [107]:
brics.loc[:, 'capital'] # Series


code
BR     Brasilia
RU       Moscow
IN    New Delhi
CH      Beijing
SA     Pretoria
Name: capital, dtype: object

In [108]:
brics.loc[:, ['capital']]   # DataFrame


Unnamed: 0_level_0,capital
code,Unnamed: 1_level_1
BR,Brasilia
RU,Moscow
IN,New Delhi
CH,Beijing
SA,Pretoria


### Slicing
Note that when slicing, no braces inside .loc[:,:]

In [None]:
df.loc[:, 'Email']

0         rachel@example.com
1      laura@yourcompany.com
2      craig@yourcompany.com
3           mary@example.com
4          jamie@example.com
5                  js@co.com
6       jhalprin@example.com
7         tjones@example.com
8       tomjones@example.com
9     greggjones@example.com
10     dthompson@example.com
Name: Email, dtype: object

In [None]:
df.loc[:, ['Id','Last name']]

Unnamed: 0,Id,Last name
0,9012,Booker
1,2070,Grey
2,4081,Johnson
3,9346,Jenkins
4,5079,Smith
5,303,Smith
6,304,Halprin
7,305,Jones
8,306,Jones
9,307,Jones


In [None]:
df.loc[:, 'First name':'Active']

Unnamed: 0,First name,Last name,Active
0,Rachel,Booker,Yes
1,Laura,Grey,Yes
2,Craig,Johnson,No
3,Mary,Jenkins,Yes
4,Jamie,Smith,Yes
5,John,Smith,No
6,Jim,Halprin,Yes
7,Teresa,Jones,Yes
8,Tommy,Jones,Yes
9,Gregg,Jones,Yes


### Selecting subsets with .iloc
The .iloc indexer is very similar to .loc but only uses integer locations to make its selections.

pandas provides a suite of methods in order to get purely integer based indexing. (0-based)
```
df.iloc[0]
```
- Selecting a single row with .iloc: `.iloc[0]` 
- Selecting multiple rows with .iloc: `.iloc[[0, 2, 4]]`
- Use slice notation to select a range of rows: `.iloc[3:5]`
- Selecting 2 rows and 2 columns: `.iloc[[0,4], [0, 2]]`

In [110]:
brics.iloc[0]   # Series

country         Brazil
capital       Brasilia
area             8.516
population       200.4
Name: BR, dtype: object

In [111]:
brics.iloc[[0, 2, 4]]   # remember, don't do df.iloc[5, 2, 4]


Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR,Brazil,Brasilia,8.516,200.4
IN,India,New Delhi,3.286,1252.0
SA,South Africa,Pretoria,1.221,52.98


In [113]:
brics.iloc[2:4]


Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


In [114]:
brics.iloc[[0, 4], [0, 2]]

Unnamed: 0_level_0,country,area
code,Unnamed: 1_level_1,Unnamed: 2_level_1
BR,Brazil,8.516
SA,South Africa,1.221


Filtering

In [92]:
df['Active'] == "Yes"
# df.Active == "Yes"


0      True
1      True
2     False
3      True
4      True
5     False
6      True
7      True
8      True
9      True
10     True
Name: Active, dtype: bool