<a href="https://colab.research.google.com/github/niklaust/Data_Science/blob/main/Python_for_Data_Analysis_notebook_of_niklaust.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Reference**
Wes McKinney. (2022). *Python for Data Analysis Data Wrangling with pandas, NumPy, and Jupyter, Third Edition*. O'Reilly

github:niklaust

start 20230327

<h1><center><b>Python for Data Analysis</b></center></h1>

# <center><b>Chapter 1. Preliminaries</b></center>

## **1.1 What we will learn about?**

An adequate preparation to enable to move on to a more domain-specific resource.

We will learn about:
* manipulating
* processing
* cleaning
* crunching data

to become an effective data analyst.

**What kinds of Data?**

The primary focus is on **structured data**

* **Tabular** or **spreadsheet-like data** in which each column maybe a different type (string, numeric, date, or otherwise). This includes most kinds of data commonly stored in relational databases or tab- or comma-delimited text files.
* **Multidimensional arrays** (matrics).
* Multiple tables of data interrelated by key columns (what would be primary or foreign keys for a **SQL** user).
* Evenly or unevenly spaced **time series**.

## **1.2 Why Python for Data Analysis?**

Python has developed a large and active scientific computing and data analysis community.

Python become one of the most **important languages** for **data science**, **machine learning**, and general software development in academia and industry.


**Why Not Python?**

There ae a number of uses for which Python may be less suitable.

As Python is an interpreted programming language, in general most Python **code will run substantially slower** than code written in a compiled language like Java or C++.

## **1.3 Essential Python Libraries**

### **NumPy**

NumPy, short for Numerical Python, has long been a cornerstone of **numerical computing** in Python. It provides the data structures, algorithms, and library glue needed for most scientific applications involving numerical data in Python. NumPy contains, among other things:

* A fast and efficient multidimensional array object **ndarray**
* Functions for performing element-wise **computations with arrays** or **mathematical operations between arrays**
* **Tools** for **reading** and **writing** array-based datasets to disk
* **Linear algebra operations**, Fourier transform, and random number generation
* A **mature C API** to enbable Python **extensions** and native C or C++ code to access NumPy's data structures and computational facilities

NumPy arrays are more efficient for storing and manipulating data. NumPy arrays as a primary data structure or else target interoperability with NumPy.

### **pandas**

pandas provides **high-level data structures** and **functions designed** to make **working with structured** or **tabular data** intuitive and flexible. 

The primary objects in pandas that will be focus on are **DataFrame**, a tabular column-oriented data structure with both row and column labels, and the **Series**, a one-dimensional labeled array object.

pandas blends the array-computing ideas of NumPy with the kinds of data manipulation capabilities found in spreadsheets and relational databases (such as SQL). It provides convenient indexing functionality to enable you to reshape, slice and  dice, perform aggregations, and select subsets of data. Since **data manipulation**, **preparetion**, and **cleaning** are such important skills in data analysis.

### **matplotlib**

matplotlib is the most popular Python library for **producing plots** and **other two-dimensional data visualizations**. 

### **IPython**

IPython is a programming tool designed to facilitate **interactive computing** and **software development work**. The tool is unique in that it encourages an execute-explore workflow rather than the typical edit-compile-run workflow of other programming languages. Additionally, IPython provides access to the operating system's shell and filesystem, which reduces the need for users to switch between a terminal window and a Python session

### **SciPy**

SciPy is a collection of packages **addressing a number of foundational problems** in **scientific computing**. 

* `scipy.integrate` : Numerical intergration routines and differential equation solvers
* `scipy.linalg` : Linear algebra routines and matrix decompositions extending beyound those provided in `numpy.linalg` 
* `scipy.optimize` : Function optimizers (minimizers) and root finding algorithms
* `scipy.signal` : Signal processing tools
* `scipy.sparse` : Sparse matrices and sparse linear system solvers 
* `scipy.special` : Wrapper around SPECFUN, a FORTRAN library implementing many common mathematical functions, such as the `gamma` function
* `scipy.stats` : Standard continuous and discrete probability distributions (density functions, samples, continuous distribution functions), various statistical tests, and more descriptive statistics.

Together, NumPy and SciPy from a resonably complete and mature computational foundation for many traditional scientific computing applications.

### **scikit-learn**

scikit-learn has become the premier general-purpose **machine learning toolkit for Python programmers**. As of this writing, more than two thousand different individuals have contributed code to the project. It includes submodules for such models as:

* **Classification**: SVM, nearest neighbors, random forest, logistic regression, etc.
* **Regression**: Lasso, ridge regression, etc.
* **Clustering**: k-means, spectral clustering, etc.
* **Dimensionality** reduction: PCA, feature selection, matrix factorization, etc.
* **Model selection**: Grid search, cross-validation, metrics
* **Preprocessing**: Feature extraction, normalization

### **statismodels**

Statsmodels is a **statistical analysis package**, which is implemented a number of **regression analysis models** popular in the R programming language.

Compared with scikit-learn, statsmodels contains algorithms for classical (primarily frequentist) **statistics** and **econometrics**. This includes such submodules as:

* **Regression models**: linear regression, generalized linear models, robust linear models, linear mixed effects models, etc.
* **Analysis of variance** (ANOVA)
* **Time series analysis**: AR, ARMA, ARIMA, VAR, and other models
* **Nonparametric methods**: Kernel denisty estimation, kernel regression
* **Visualization of statistical mode**l results

statsmodels is more focused on **statistical inference**, providing uncertainty estimates and p-values for parameters. scikit-learn, by contrast, is more prediction focused. 


Guideline for different end goals for their work, the tasks required generally fall into a number of different broad groups:

* **Interacting with the outside world** - Reading and writing with a variety of file format and data stores
* **Preparation** - Cleaning, munging, combining, normalizing, reshaping, slicing and dicting, and transforming data for analysis
* **Transformation** - Applying mathematical and statistical operations to groups datasets to derive new datasets (e.g., aggregating a large table by group variables)
* **Modeling and computation**  - Connneting your data to statistical models, machine learning algorithms, or other computational tools
* **Presentation** - Creating interactive or static graphical visualizations or textual summaries

# <center><b>Chapter 2. </b></center>