<a href="https://colab.research.google.com/github/niklaust/Data_Science/blob/main/Python_for_Data_Analysis_notebook_of_niklaust.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Reference**
Wes McKinney. (2022). *Python for Data Analysis Data Wrangling with pandas, NumPy, and Jupyter, Third Edition*. O'Reilly

github:niklaust

start 20230327

<h1><center><b>Data Analysis</b></center></h1>

# <center><b>Chapter 1. Preliminaries</b></center>

## **1.1 What we will learn about?**

An adequate preparation to enable to move on to a more domain-specific resource.

We will learn about:
* manipulating
* processing
* cleaning
* crunching data

to become an effective data analyst.

**What kinds of Data?**

The primary focus is on **structured data**

* **Tabular** or **spreadsheet-like data** in which each column maybe a different type (string, numeric, date, or otherwise). This includes most kinds of data commonly stored in relational databases or tab- or comma-delimited text files.
* **Multidimensional arrays** (matrics).
* Multiple tables of data interrelated by key columns (what would be primary or foreign keys for a **SQL** user).
* Evenly or unevenly spaced **time series**.

## **1.2 Why Python for Data Analysis?**

Python has developed a large and active scientific computing and data analysis community.

Python become one of the most **important languages** for **data science**, **machine learning**, and general software development in academia and industry.


**Why Not Python?**

There ae a number of uses for which Python may be less suitable.

As Python is an interpreted programming language, in general most Python **code will run substantially slower** than code written in a compiled language like Java or C++.

## **1.3 Essential Python Libraries**

### **NumPy**

NumPy, short for Numerical Python, has long been a cornerstone of **numerical computing** in Python. It provides the data structures, algorithms, and library glue needed for most scientific applications involving numerical data in Python. NumPy contains, among other things:

* A fast and efficient multidimensional array object **ndarray**
* Functions for performing element-wise **computations with arrays** or **mathematical operations between arrays**
* **Tools** for **reading** and **writing** array-based datasets to disk
* **Linear algebra operations**, Fourier transform, and random number generation
* A **mature C API** to enbable Python **extensions** and native C or C++ code to access NumPy's data structures and computational facilities

NumPy arrays are more efficient for storing and manipulating data. NumPy arrays as a primary data structure or else target interoperability with NumPy.

### **pandas**

pandas provides **high-level data structures** and **functions designed** to make **working with structured** or **tabular data** intuitive and flexible. 

The primary objects in pandas that will be focus on are **DataFrame**, a tabular column-oriented data structure with both row and column labels, and the **Series**, a one-dimensional labeled array object.

pandas blends the array-computing ideas of NumPy with the kinds of data manipulation capabilities found in spreadsheets and relational databases (such as SQL). It provides convenient indexing functionality to enable you to reshape, slice and  dice, perform aggregations, and select subsets of data. Since **data manipulation**, **preparetion**, and **cleaning** are such important skills in data analysis.

### **matplotlib**

matplotlib is the most popular Python library for **producing plots** and **other two-dimensional data visualizations**. 

### **IPython**

IPython is a programming tool designed to facilitate **interactive computing** and **software development work**. The tool is unique in that it encourages an execute-explore workflow rather than the typical edit-compile-run workflow of other programming languages. Additionally, IPython provides access to the operating system's shell and filesystem, which reduces the need for users to switch between a terminal window and a Python session

### **SciPy**

SciPy is a collection of packages **addressing a number of foundational problems** in **scientific computing**. 

* `scipy.integrate` : Numerical intergration routines and differential equation solvers
* `scipy.linalg` : Linear algebra routines and matrix decompositions extending beyound those provided in `numpy.linalg` 
* `scipy.optimize` : Function optimizers (minimizers) and root finding algorithms
* `scipy.signal` : Signal processing tools
* `scipy.sparse` : Sparse matrices and sparse linear system solvers 
* `scipy.special` : Wrapper around SPECFUN, a FORTRAN library implementing many common mathematical functions, such as the `gamma` function
* `scipy.stats` : Standard continuous and discrete probability distributions (density functions, samples, continuous distribution functions), various statistical tests, and more descriptive statistics.

Together, NumPy and SciPy from a resonably complete and mature computational foundation for many traditional scientific computing applications.

### **scikit-learn**

scikit-learn has become the premier general-purpose **machine learning toolkit for Python programmers**. As of this writing, more than two thousand different individuals have contributed code to the project. It includes submodules for such models as:

* **Classification**: SVM, nearest neighbors, random forest, logistic regression, etc.
* **Regression**: Lasso, ridge regression, etc.
* **Clustering**: k-means, spectral clustering, etc.
* **Dimensionality** reduction: PCA, feature selection, matrix factorization, etc.
* **Model selection**: Grid search, cross-validation, metrics
* **Preprocessing**: Feature extraction, normalization

### **statismodels**

Statsmodels is a **statistical analysis package**, which is implemented a number of **regression analysis models** popular in the R programming language.

Compared with scikit-learn, statsmodels contains algorithms for classical (primarily frequentist) **statistics** and **econometrics**. This includes such submodules as:

* **Regression models**: linear regression, generalized linear models, robust linear models, linear mixed effects models, etc.
* **Analysis of variance** (ANOVA)
* **Time series analysis**: AR, ARMA, ARIMA, VAR, and other models
* **Nonparametric methods**: Kernel denisty estimation, kernel regression
* **Visualization of statistical mode**l results

statsmodels is more focused on **statistical inference**, providing uncertainty estimates and p-values for parameters. scikit-learn, by contrast, is more prediction focused. 


Guideline for different end goals for their work, the tasks required generally fall into a number of different broad groups:

* **Interacting with the outside world** - Reading and writing with a variety of file format and data stores
* **Preparation** - Cleaning, munging, combining, normalizing, reshaping, slicing and dicting, and transforming data for analysis
* **Transformation** - Applying mathematical and statistical operations to groups datasets to derive new datasets (e.g., aggregating a large table by group variables)
* **Modeling and computation**  - Connneting your data to statistical models, machine learning algorithms, or other computational tools
* **Presentation** - Creating interactive or static graphical visualizations or textual summaries

# <center><b>Chapter 2. Python Language Basics, IPython and Jupyter Notebooks </b></center>

An introductory text in working with data in Python.

Mostly, focuses on:
* Table-based analytics
* Data preparation tools for working with data sets

Sometime, we do some wrangling to arrange messy data into a more nicely tabular (or structured) form.

## **2.1 The Python Interpreter**

Python is an **interpreted language**. The Python interpreter **runs a program by executing one statement at a time**.

Some do data analysis or scientific computing make use of **IPython**, an enhanced Python interpreter, or **Jupyter notebooks**, Web-based code notebooks originally created within the IPython project.

## **2.2 IPython Basics**

IPython provides facilities to execute arbitrary blocks of code (via a somewhat glorified copy-and-paste approach) and whole Python scripts. 

In [6]:
a = 5
a

5

In [5]:
import numpy as np

data = [np.random.standard_normal() for i in range(7)]
data

[-0.0897850222318582,
 1.7311433792888753,
 1.4131745614923839,
 -0.151770027838262,
 1.3853420237700667,
 -0.3713195143206432,
 0.35114147658826644]

**Jupyter Notebook**



One of the major components of the Jupyter project is the **notebook**, a type of **interactive document for code**, text (including Markdown), data visualizations, and other output.

**Introspection**

Using a question mark (?) before or after a variable will display some general information about the object:

In [10]:
b = [1, 2, 3]

In [12]:
b?

In [13]:
print?

In [14]:
def add_numbers(a, b):
  """
  Add two numbers together
  
  Returns 
  --------
  the_sum : type of arguments
  """
  return a + b

In [15]:
add_numbers?

? has a final usage, which is for searching the IPython namespace in a manner similar to the standard Unix or Windows command line. A number of characters combined with the wildcard (\*) will show all names matching the wildcard expression. For example, we could get a list of all functions in the top-level NumPy namespace containing `load:`

In [16]:
import numpy as np 

In [17]:
np.*load*?

## **2.3 Python Language Basics**

**Language Semantics**

The Python language design is distinguished by its **emphasis on readability**, **simplicity**, and **explicitness**. Some people go so far as to liken it to "executable pseudocode."

Python uses **whitespace** (tabs or spaces) to structure code instead of using braces as in many other languages

A **colon** denotes the **start of an indented code block** after which all of the code must be indented by the same amount until the end of the block.

**Semicolons** can be used to **separate multiple statements** on a single line, Howeve, Putting multiple statements on one line is generally discouraged in Python as it can make code less readable.


**Comments** any text preceded by the hash make (pound sign) `#` is ignored by the Python interpreter. this is often used to add comments to code.

**Variables and argument passing**

When **assigning a variable** (or name) in Python, you are **creating a reference to the object** shown on the righthand side of the equals sign. In practical terms, consider a list of integers:


In [18]:
a = [1, 2, 3]

In [19]:
b = a 

In [20]:
b

[1, 2, 3]

In some languages, the assignment if `b` will casue the `data[1, 2, 3]` to be **copied.** 

In **Python**, `a` and `b` actually **now refer to the same object**, the original list `[1,2,3]`

In [21]:
a.append(4)

In [22]:
b

[1, 2, 3, 4]

When you pass objects as arguments to a **function**, new **local variables are created** referencing the original objects without any copying. If you bind a new object to a variable inside a function, that will not overwrite a variable of the same name in the "scope" outside of the function (the "parent scope"). It is therefore possible to alter the internals of a mutable argument. 

In [23]:
def append_element(some_list, element):
  some_list.append(element)

In [25]:
data = [1, 2, 3]
append_element(data, 4)

In [26]:
data

[1, 2, 3, 4]

**Dynamic references, strong types**

Variables in Python have **no inherent type** associated with them; a variable can refer to a different type of object simply by doing an assignment. 

In [27]:
a = 5

In [28]:
type(a)

int

In [29]:
a = "foo"

In [30]:
type(a)

str

Python is a **strongly type language**, which means that **every object has a specific type (or class),** and implicit conversions will occur only in certain permitted circumstances.

In [32]:
a = 4.5
b = 2

In [33]:
# String formatting, to be visited later
print(f"a is {type(a)}, b is {type(b)}")

a is <class 'float'>, b is <class 'int'>


In [34]:
a / b

2.25

**Knowing the type of an object is important**, and it's useful ot be able to write functions that can handle many different kinds of input. You can check that an objectis an instance of a particular type using the `isinstance` function:

In [35]:
a = 5

isinstance(a, int)

True

`isinstance` can accept a tuple of types if you want to check that an objct's type is among those present in the tuple:

In [40]:
a = 5
b = 4.5

In [37]:
isinstance(a, (int, float))

True

In [42]:
isinstance(b, (int, float))

True

**Attributes and methods**

Objects in Python typically have both: 

* **attributes** (other Python objects stored "inside" the object) 
* **methods** (functions associated with an object that can have access to the object's internal data). 

Both of them are accessed via the syntax `obj.attribute_name:`

In [43]:
a = "foo"

In [44]:
getattr(a, "split")

<function str.split(sep=None, maxsplit=-1)>

**Duck typing**

Often you may **not care about the type of an object** but rather only whether it has **certain methods or behavior**. This is sometimes called duck typing,

In [45]:
def isiterable(obj):
  try:
    iter(obj)
    return True
  except TypeError: # not iterable
    return False

In [46]:
isiterable("a string")

True

In [47]:
isiterable([1, 2, 3])

True

In [48]:
isiterable(5)

False

**Imports**

In Python, a module is simply a file with the `.py` extension containing Python code. Suppose we had the following module:

In [53]:
# some_module.py
PI = 3.14159

def f(x):
  return x + 2

def g(a, b):
  return a + b

If we wanted to access the variables and functions defined in `some_module.py`, from another file in the same directory we could do:

In [54]:
import some_module

result = some_module.f(5)
pi = some_module.PI

In [55]:
from some_module import g, PI

result = g(5, PI)

In [56]:
import some_module as sm
from some_module import PI as pi, g as gf 

r1 = sm.f(pi)
r2 = gf(6, pi)

**Binary operators and comparisions**

In [57]:
5 - 7

-2

In [58]:
12 + 21.5

33.5

In [59]:
5 <= 2

False

<table>
  <tr>
    <th colspan="2"><h4><b>Binary Operators</b></h4></th>
  </tr>
  <tr>
    <th>Operation</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>a + b</td>
    <td>Add a and b
</td>
  </tr>
  <tr>
    <td>a - b</td>
    <td>Subtract b from a</td>
  </tr>
  <tr>
    <td>a * b</td>
    <td>Multiply a by b</td>
  </tr>
  <tr>
    <td>a / b</td>
    <td>Divide a by b</td>
  </tr>
  <tr>
    <td>a // b</td>
    <td>Floor-divide a by b, dropping any fractional remainder</td>
  </tr>
  <tr>
    <td>a ** b </td>
    <td>Raise a to the b power</td>
  </tr>
  <tr>
    <td>a & b</td>
    <td>True if both a and b are True; for integers, take the bitwise AND</td>
  </tr>
  <tr>
    <td>a | b</td>
    <td>True if either a or b is True; for integers, take the bitwise OR</td>
  </tr>
  <tr>
    <td>a ^ b</td>
    <td>For Booleans, True if a or b is True, but not both; for integers, take the bitwise EXCLUSIVE-OR</td>
  </tr>
  <tr>
    <td>a == b</td>
    <td>True if a equals b</td>
  </tr>
  <tr>
    <td>a != b</td>
    <td>True if a is not equal to b</td>
  </tr>
  <tr>
    <td>a < b, a <= b</td>
    <td>True if a is less than (less than or equal to) b</td>
  </tr>
  <tr>
    <td>a > b, a >= b</td>
    <td>True if a is greater than (greater than or equal to) b</td>
  </tr>
  <tr>
    <td>a is b </td>
    <td>True if a and b reference the same Python objec</td>
  </tr>
  <tr>
    <td>a is not b</td>
    <td>True if a and b reference different Python object</td>
  </tr>
</table>

`is not` to check that two objects are not the same:

In [60]:
a = [1, 2, 3]
b = a 
c = list(a)

In [61]:
a is b

True

In [62]:
a is not c

True

`list` fuction always create a new Python list (i.e., a copy), we can be sure that `c` is distinct from `a`. Comparing with `is` is not the same as the `==` operator.

In [63]:
a == c

True

A common use of `is` and `is not` is to check if a variable is `None`, since there is only one instance of `None`:

In [64]:
a = None

In [65]:
a is None

True

**Mutable and immutable objects**

* mutable object or values that they contain **can be modified** such as list, dictionaies, NumPy arrays, and most user-defined types (classes)

* immutable their internal data **cannot be changed** such as tuple

In [66]:
a_list = ["fool", 2, [4, 5]]
a_list[2] = (3, 4)
a_list

['fool', 2, (3, 4)]

In [67]:
a_tupe = (3, 5, (4, 5))

try:
  a_tuple[1] = "four"
except:
  print("error")

error


**Scalar Types**

Python has a small set of built-in types for handling numerical data, strings, Boolean (True or False) values, and dates and time. These "single value" types are sometimes called scalar types, and we refer to them scalars.

<table>
  <tr>
    <th colspan="2"><h4><b>Standard Python scalar types</b></h4></th>
  </tr>
  <tr>
    <th>Type</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>None</td>
    <td>The Python “null” value (only one instance of the None object exists)</td>
  </tr>
  <tr>
    <td>str</td>
    <td>String type; holds Unicode strings
</td>
  </tr>
  <tr>
    <td>bytes</td>
    <td>Raw binary data
</td>
  </tr>
  <tr>
    <td>float</td>
    <td>Double-precision floating-point number (note there is no separate double type)
</td>
  </tr>
  <tr>
    <td>bool</td>
    <td>A Boolean True or False value</td>
  </tr>
  <tr>
    <td>int</td>
    <td>Arbitrary precision integer</td>
  </tr>
</table>

**Numeric types**

The primary Python types for numbers are `int` and `float`. 

* `int` can store arbitararily larger numbers
* `float` number with a double-precision value.

In [68]:
ival = 123456
ival ** 6

3540570200530940541182574329856

In [69]:
fval = 7.25
fval2 = 6.78e-5

In [70]:
3/2

1.5

In [72]:
9//2

4

**Strings**

You can write string literals using either single quotes `'` or double quotes  `"` (double quotes are generally favored)

In [73]:
a = 'one way of writing a string'
b = "another way"

For multi-line  strings with line breaks, you can use triple quotes, `'''` or `"""`:

In [74]:
c = """
This is a longer string that 
spans multiple lines
"""

In [75]:
c.count("\n")

3

Python strings are immutable, you cannot modify a string:

In [76]:
a = "this is a string"

try: 
  a[10] = "f"
except:
  print("error")

error


In [77]:
b = a.replace("string", "longer string")
b 

'this is a longer string'

In [78]:
a 

'this is a string'

In [80]:
a = 5.6

s = str(a)
print(s, type(s))

5.6 <class 'str'>


In [81]:
s = "python"

list(s)

['p', 'y', 't', 'h', 'o', 'n']

In [82]:
s[:3]

'pyt'

In [83]:
s = "12\\34"

print(s)

12\34


In [84]:
# r: raw string
s = r"this\has\no\special\characters"

s

'this\\has\\no\\special\\characters'

In [85]:
# concatenates two strings
a = "this is the first half"
b = "and this is the second half"

a + b 

'this is the first halfand this is the second half'

In [89]:
template = "{0:.2f} {1:s} are worth US${2:d}"

# {0:.2f} means to format the first argument as a floating-point number with two decimal places
# {1:s} means to format the second argument as a string.
# {2:d} means to format the thrid argument as an exact integer

template.format(88.46, "Argentine Pesos", 1)

'88.46 Argentine Pesos are worth US$1'

In [90]:
amount = 10
rate = 88.46
currency = "Pesos"

result = f"{amount} {currency} is worth US${amount / rate}"
result

'10 Pesos is worth US$0.11304544426859599'

In [91]:
f"{amount} {currency} is worth US${amount/ rate:.2f}"

'10 Pesos is worth US$0.11'

**Bytes and Unicode**

In [92]:
val = "español"

val 

'español'

In [94]:
val_utf8 = val.encode("utf-8")

val_utf8

b'espa\xc3\xb1ol'

In [95]:
type(val_utf8)

bytes

In [96]:
val_utf8.decode("utf-8")

'español'

While it is now preferable to use UTF-8 for any encoding, for historical reasons you may encounter data in any number of different encodings:

In [97]:
val.encode("latin1")

b'espa\xf1ol'

In [98]:
val.encode("utf-16")

b'\xff\xfee\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00'

In [99]:
val.encode("utf-16le")

b'e\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00'

**Booleans**

The two Boolean values in Python are written as `True` and `False`. Comparisions and other conditional expressions evaluate to either `True` or `False`. Boolean values are combined with the `and` and `or` keywords:

In [100]:
True and True

True

In [101]:
False or True

True

In [102]:
int(False)

0

In [103]:
int(True)

1

In [104]:
a = True 
b = False

In [105]:
not a 

False

In [106]:
not b

True

**Type casting**

The `str`, `bool`, `int` and `float` types are also functions that can be used to cast values to those types:


In [107]:
s = "3.141459"

fval = float(s)
type(fval)

float

In [108]:
int(fval)

3

In [109]:
bool(fval)

True

In [110]:
bool(0)

False

**None**

`None` is the Python null value type:

In [111]:
a = None

a is None

True

In [112]:
b = 5

b is not None

True

In [113]:
def add_and_maybe_multiply(a, b, c=None):
  result = a + b

  if c is not None:
    result = result * c 
  
  return result

**Dates and times**

The built-in Python `datetime` module provides `datetime`, `date`, and `time` types. The `datetime` type combines the information stored in `date` and `time` and is the most commonly used:

In [117]:
from datetime import datetime, date, time

dt = datetime(2023, 3, 8, 19, 59, 30) # year, month, day, hour, minute, second

In [116]:
dt.month

3

In [118]:
dt.second

30

In [120]:
dt.date()

datetime.date(2023, 3, 8)

In [121]:
dt.time()

datetime.time(19, 59, 30)

In [123]:
dt.strftime("%Y-%m-%d %H:%M")

'2023-03-08 19:59'

In [124]:
dt_hour = dt.replace(minute=0, second=0)

dt_hour

datetime.datetime(2023, 3, 8, 19, 0)

In [125]:
dt

datetime.datetime(2023, 3, 8, 19, 59, 30)

In [126]:
dt2 = datetime(2023, 4, 15, 19, 59, 30)

delta = dt2 - dt
delta

datetime.timedelta(days=38)

In [127]:
type(delta)

datetime.timedelta

In [128]:
dt 

datetime.datetime(2023, 3, 8, 19, 59, 30)

In [129]:
dt + delta

datetime.datetime(2023, 4, 15, 19, 59, 30)

**Control Flow**

Python has several built-in keywords for **conditional logic**, **loops**, and **other standard control flow concepts** found in other programming languages.

**if, elif, and else**

In [130]:
x = -5 

if x < 0:
  print("It's negative")

It's negative


In [131]:
if x < 0:
  print("It's negative")
elif x == 0:
  print("Equal to zero")
elif 0 < x < 5:
  print("Positive but smaller than 5")
else:
  print("Positive and larger than or equal to 5")

It's negative


In [132]:
a = 5; b = 7
c = 8; d = 4

if a < b or c > d:
  print("Made it")

Made it


**for loops**

`for` loops are for iterating over a collection (like a list or tuple) or an iterater. 

In [136]:
sequence = [1, 2, None, 4, None, 5]
total = 0

for value in sequence:
  if value is None:
    continue
  total += value

print(total)

12


In [137]:
sequence = [1, 2, 0, 4, 6, 5, 2, 1]
total_until_5 = 0

for value in sequence:
  if value == 5:
    break 
  total_until_5 += value

print(total_until_5)

13


In [138]:
for i in range(4):
  for j in range(4):
    if j > i:
      break
    print((i, j))

(0, 0)
(1, 0)
(1, 1)
(2, 0)
(2, 1)
(2, 2)
(3, 0)
(3, 1)
(3, 2)
(3, 3)


**while loops**

A `while` loop specifies a condition and a block of code that is to be executed until the condition evaluates to `False` or the loop is explicitly ended with `break`

In [None]:
256

In [142]:
x = 256
total = 0

while x > 0:
  if total > 500:
    break
  total += x
  x = x // 2
  print(total, x)


256 128
384 64
448 32
480 16
496 8
504 4


**pass**

`pass` is the **"no-op" (or "do nothing") statement** in Python. It can be used in blocks where no action is to be taken (or as a placeholder for code not yet implemented); it is required only because Python used whitespace to delimit blocks:


In [143]:
if x < 0:
  print("negative!")
elif x == 0:
  # TODO: put something smart here
  pass
else:
  print("positive!")

positive!


**range**

The `range` function generates a sequence of evenly spaced integers:

In [144]:
range(10)

range(0, 10)

In [145]:
list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [146]:
list(range(0, 20, 2))      # start, stop, step

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [147]:
list(range(5, 0, -1))

[5, 4, 3, 2, 1]

In [148]:
seq = [1, 2, 3, 4]

for i in range(len(seq)):
  print(f"element {i}: {seq[i]}")

element 0: 1
element 1: 2
element 2: 3
element 3: 4


In [149]:
total = 0

for i in range(100_000):
  # % is the modulo operator
  if i % 3 == 0 or i % 5 == 0:   # accumulate all the number that can divide by 3 or 5  
    total += i

print(total)

2333316668
