### NumPy 
- NumPy is an open-source library, predominantly used for working with arrays.
- NumPy enables most of the operations required in linear algebra.
- It is computationally efficient as it works with arrays rather than traditional python lists.
- It is used aside SciPy and Matplotlib.
- This replaces Matlab which used to be the traditional way of technical computations.

### Pandas
- Pandas is open-source library built on top of NumPy. Thus, several NumPy structures are used or replicated in Pandas.
- Word 'Pandas' is derived from 'Panel data', a term in Econometrics.
- Pandas facilitates working with tabular data, time series data, Matrix data etc.
- Pandas are best-suited for
  - Import data from CSV, JSON files, etc.
  - Easily handling missing values in the dataset.
  - Wrangle and manipulate data.
  - Merge multiple datasets.
  - Export results as CSV,JSON files, etc.
  - Easily handling time-series data.

### SciPy
- It's the abbreviation for Scientific Python.
- SciPy is an open-source library built on top of NumPy.
- It's tailor-made for scientific and engineering applications.
- It offers loads of features, For Ex: constants, optimizers, graphs and much more, that can be useful while dealing with data.
- It had sub-modules for computationally intensive areas like:
   - Optimization
   - Linear algebra
   - Signal and Image Processing, etc.
- SciPy uses a multi-dimensional array, provided by the NumPy library,as a basic data structure.

### Statsmodels
- Its a crucial Python library, Statmodels allow estimationof statiscal models and perform statistical tests.
- It covers descriptive statistics,statistical tests,plotting functions,etc. extensively.
- They are capable of handling deep statistical projects.
- Some of the important features provided by statsmodels are:
   - Linear regression
   - ANOVA
   - Time series analysis
   - Statistical tests
   - Graphics

## Fundamentals of NumPy
- Stands for Numerical Python
- It is an open-source library used primarily for mathematical operations in science and engineering application.
- NumPy is convenient for performing easy operations on multi-dimensional operations arrays and matrices.
- As NumPy is written in C, the speed of executing mathematical operations on arrays using NumPy is high.
- It is an efficient multidimensional container of generic data.
- As we can define arbitrary data types using NumPy, it aids in integrating with a wide variety of databases.
- NumPy array is called ndarray. Some important attributes of ndarray are -
   - **ndarray.ndim** - number of axes (dimensions) of the array.
   - **ndarray.shape** - tuple of integers that give the size of the array in each dimension. Example: For an m*n matrix,the shape is (m,n). Length of                           shape tuple is number of axes, ndim.
   - **ndarray.size** - the total number of elements in the array ehich is equal to the product of elements in the shape.
   - **ndarray.dtype** - the object that describes the type of elements in the array. Ex: numpy.int16, numpy.int32, numpy.float64.
   - **ndarray.itemsize** - the size in bytes of each item in an array.
   - **ndarray.data** - the buffer contains the actual element in an array. Use indexing to access each element in the array.

### NumPy array Shapes and axes

In [2]:
#import the numpy package
import numpy as np

In [3]:
#0 dimensional array
np.array(24)

array(24)

In [4]:
#1 dimensional array
np.array([1,2,3,4])

array([1, 2, 3, 4])

In [5]:
# 2 dimensional array
np.array([[1,2,3],[1,2,3]])

array([[1, 2, 3],
       [1, 2, 3]])

In [6]:
# 3 dimensional array
numpy_array=np.array([[[1,2,3],[1,2,3]],[[1,2,3],[1,2,3]]])
numpy_array

array([[[1, 2, 3],
        [1, 2, 3]],

       [[1, 2, 3],
        [1, 2, 3]]])

In [7]:
print(numpy_array.shape)

(2, 2, 3)


In [8]:
numpy_array.ndim

3

**Create numpy arrays based on dimension you want**

In [9]:
np.array([1,2,3,4,5],ndmin=4)

array([[[[1, 2, 3, 4, 5]]]])

**Convert the dimension using reshape method**

In [11]:
numpy_arr=np.array([x for x in range(1,10)])
numpy_arr

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [13]:
reshaped_array=numpy_arr.reshape(3,3)
reshaped_array

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [14]:
reshaped_array.flatten()

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [15]:
#change row to column
reshaped_array.transpose()

array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

In [12]:
numpy_arr=np.array([x for x in range(1,9)])
numpy_arr
numpy_arr.reshape(2,2,2)

array([[[1, 2],
        [3, 4]],

       [[5, 6],
        [7, 8]]])

**Arithematic Operations using NumPy arrays**

In [13]:
A=np.array([[1,2,3],[2,3,4]])
B=np.array([[3,4,5],[5,6,7]])
#Addition
print(np.add(A,B))
#Subtraction
print(np.subtract(A,B))
#Multiplication
print(np.multiply(A,B))
#Division
print(np.divide(A,B))
#A to the power B
print(np.power(A,B))

[[ 4  6  8]
 [ 7  9 11]]
[[-2 -2 -2]
 [-3 -3 -3]]
[[ 3  8 15]
 [10 18 28]]
[[0.33333333 0.5        0.6       ]
 [0.4        0.5        0.57142857]]
[[    1    16   243]
 [   32   729 16384]]


**Conditional Statements using Numpy arrays**

In [14]:
num_arr=np.array([x for x in range(1,10)])
np.where(num_arr%2==0,'Even','odd')

array(['odd', 'Even', 'odd', 'Even', 'odd', 'Even', 'odd', 'Even', 'odd'],
      dtype='<U4')

**numpy.select**
-**Syntax:** numpy.select(condlist, choicelist, default=0) to False.


In [15]:
condition_list=[num_arr>5,num_arr<5]
#if x>5 calculate square of x else if x<5 calculate cube of x
choice_list=[num_arr**2,num_arr**3]
np.select(condition_list,choice_list,default=num_arr)

array([ 1,  8, 27, 64,  5, 36, 49, 64, 81])

**Common Mathematical and Statiscal functions in numpy**

In [19]:
import numpy as np
arr=np.array([[2,6,4],[7,3,5],[5,7,8]])
arr

array([[2, 6, 4],
       [7, 3, 5],
       [5, 7, 8]])

In [20]:
#returns the minimum value in the array
print(np.min(arr))

#returns the minimum values in horizontal and vertical axes
print(np.min(arr,axis=0))
print(np.min(arr,axis=1))

#returns the maximum value in the array
print(np.amax(arr))

#returns the minimum values in horizontal and vertical axes
print(np.max(arr,axis=0))
print(np.max(arr,axis=1))

2
[2 3 4]
[2 3 5]
8
[7 7 8]
[6 7 8]


In [22]:
arr=np.array([[2,6,4],[7,3,5],[5,7,8]])
arr

array([[2, 6, 4],
       [7, 3, 5],
       [5, 7, 8]])

In [23]:
#Calculate the median
print(np.median(arr))

#Calculate mean(average of the data)
print(np.mean(arr))
#Mean and Median are not same

#Calculate standard deviation
print(np.std(arr))

#Calculate variance
print(np.var(arr))

print(np.mode(arr))

#calculate the percentile of the array (percentile can identify the values as array automatically with out initializing as numpy array)
print(np.percentile(arr,50))

5.0
5.222222222222222
1.8724777273725242
3.506172839506173


AttributeError: module 'numpy' has no attribute 'mode'

In [4]:
#import the numpy package
import numpy as np
#string concatenation
str1=np.array(['a for ','b for '])
str2=np.array(['apple','ball'])
concat=np.char.add(str1,str2)
concat

array(['a for apple', 'b for ball'], dtype='<U11')

In [19]:
deg=np.array([0,30,45,60,90])
#Calculate the sine of the degrees
print(np.sin(deg*np.pi/180))

#Calculate the cosine of the degrees
print(np.cos(deg*np.pi/180))

#Calculate the tan of the degrees
print(np.tan(deg*np.pi/180))

#Calculate the inverse of sin,cos and tan
# print(np.arcsin(deg*np.pi/180))
# print(np.arccos(deg*np.pi/180))
# print(np.arctan(deg*np.pi/180))

[0.         0.5        0.70710678 0.8660254  1.        ]
[1.00000000e+00 8.66025404e-01 7.07106781e-01 5.00000000e-01
 6.12323400e-17]
[0.00000000e+00 5.77350269e-01 1.00000000e+00 1.73205081e+00
 1.63312394e+16]
[0.         0.55106958 0.90333911        nan        nan]
[1.57079633 1.01972674 0.66745722        nan        nan]
[0.         0.48234791 0.66577375 0.80844879 1.00388482]


  print(np.arcsin(deg*np.pi/180))
  print(np.arccos(deg*np.pi/180))


In [20]:
num=np.array([1.0,0.8,-2.2,-9.87])
#returns the nearest previous whole number
print(np.floor(num))
#returns the nearest next whole number
print(np.ceil(num))

[  1.   0.  -3. -10.]
[ 1.  1. -2. -9.]


**Indexing and Slicing with NumPy**

In [26]:
array_1d = np.array([1,2,3,4,5,6])
array_2d = np.array([[1,2,3],[4,5,6]])
#3d has depth(layer),rows and columns
array_3d = np.array([[[1,2,3],[3,4,5]],
                     [[6,7,8],[9,10,11]]])
print(array_1d)
print(array_2d)
print(array_3d)

[1 2 3 4 5 6]
[[1 2 3]
 [4 5 6]]
[[[ 1  2  3]
  [ 3  4  5]]

 [[ 6  7  8]
  [ 9 10 11]]]


**Indexing**

In [36]:
print(array_1d[0]) #returns 1st element
print(array_1d[-3]) #returns 4th element (negative means counting reverse - it can be applied in any dimensional arrays))
print(array_2d[1][2]) #returns element from 2nd row and 3rd column (index starts from 0)
print(array_3d[0][1][2]) #returns elements from 1st depth ,1st row,2nd column 

1
4
6
5


**Slicing**

In [4]:
array_1d = np.array([1,2,3,4,5,6])
array_2d = np.array([[1,2,3],[4,5,6]])
#3d has depth(layer),rows and columns
array_3d = np.array([[[1,2,3],[3,4,5]],
                     [[6,7,8],[9,10,11]]])

print(array_1d[1:]) #sliced from index 1 (2nd element) to the last
print(array_1d[2:4]) #sliced from index 2(3rd element) to index 4(5th element)
print(array_1d[-3:-1]) #sliced from 3rd index(4th element) to the last as negative counts reverse
print(array_2d[0:,1])
print(array_2d[0:,1:3])

print(array_2d[0:3,1:3]) #returns element from 2nd row and 3rd column (index starts from 0)
# print(array_3d[0][1][2]) #returns elements from 1st depth ,1st row,2nd column 

[2 3 4 5 6]
[3 4]
[4 5]
[2 5]
[[2 3]
 [5 6]]
[[2 3]
 [5 6]]


**Attributes of NumPy**

In [3]:
import numpy as np
array_2d = np.array([[1,2,3],[4,5,6]])
#Returns the number of dimensions
array_2d.ndim

2

In [6]:
#returns the number of elements
array_2d.size

6

In [4]:
array_2d.shape

(2, 3)

In [7]:
array_2d.itemsize

4

In [8]:
array_2d.dtype

dtype('int32')

In [9]:
array_2d.data

<memory at 0x00000209D5F28110>

**File Handling with NumPy**

In [2]:
import numpy as np

In [20]:
#Read the csv file
arr=np.loadtxt("Advertising.csv",dtype=str,delimiter=',')
arr

array([['""', '"TV"', '"Radio"', '"Newspaper"', '"Sales"'],
       ['"1"', '230.1', '37.8', '69.2', '22.1'],
       ['"2"', '44.5', '39.3', '45.1', '10.4'],
       ...,
       ['"198"', '177', '9.3', '6.4', '12.8'],
       ['"199"', '283.6', '42', '66.2', '25.5'],
       ['"200"', '232.1', '8.6', '8.7', '13.4']], dtype='<U11')

In [21]:
arr = np.genfromtxt("Advertising.csv", delimiter=',', skip_header=1, dtype=str)
arr


array([['"1"', '230.1', '37.8', '69.2', '22.1'],
       ['"2"', '44.5', '39.3', '45.1', '10.4'],
       ['"3"', '17.2', '45.9', '69.3', '9.3'],
       ['"4"', '151.5', '41.3', '58.5', '18.5'],
       ['"5"', '180.8', '10.8', '58.4', '12.9'],
       ['"6"', '8.7', '48.9', '75', '7.2'],
       ['"7"', '57.5', '32.8', '23.5', '11.8'],
       ['"8"', '120.2', '19.6', '11.6', '13.2'],
       ['"9"', '8.6', '2.1', '1', '4.8'],
       ['"10"', '199.8', '2.6', '21.2', '10.6'],
       ['"11"', '66.1', '5.8', '24.2', '8.6'],
       ['"12"', '214.7', '24', '4', '17.4'],
       ['"13"', '23.8', '35.1', '65.9', '9.2'],
       ['"14"', '97.5', '7.6', '7.2', '9.7'],
       ['"15"', '204.1', '32.9', '46', '19'],
       ['"16"', '195.4', '47.7', '52.9', '22.4'],
       ['"17"', '67.8', '36.6', '114', '12.5'],
       ['"18"', '281.4', '39.6', '55.8', '24.4'],
       ['"19"', '69.2', '20.5', '18.3', '11.3'],
       ['"20"', '147.3', '23.9', '19.1', '14.6'],
       ['"21"', '218.4', '27.7', '53.4', '18'],

Sample

In [23]:
arr=np.array([[1,2,3],[4,5,6]])
arr

array([[1, 2, 3],
       [4, 5, 6]])

In [24]:
np.savetxt('sample_numpy.csv',arr,delimiter=',')

In [25]:
np.save('file.npy',arr)

In [26]:
arr=np.load('file.npy')

In [27]:
arr

array([[1, 2, 3],
       [4, 5, 6]])

### Examples using Statsmodel

In [1]:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [2]:
#Get the Guerry Dataset for statistical analysis
data=sm.datasets.get_rdataset('Guerry','HistData').data

In [3]:
#print the first 5 rows in dataset
data.head()

Unnamed: 0,dept,Region,Department,Crime_pers,Crime_prop,Literacy,Donations,Infants,Suicides,MainCity,...,Crime_parents,Infanticide,Donation_clergy,Lottery,Desertion,Instruction,Prostitutes,Distance,Area,Pop1831
0,1,E,Ain,28870,15890,37,5098,33120,35039,2:Med,...,71,60,69,41,55,46,13,218.372,5762,346.03
1,2,N,Aisne,26226,5521,51,8901,14572,12831,2:Med,...,4,82,36,38,82,24,327,65.945,7369,513.0
2,3,C,Allier,26747,7925,13,10973,17044,114121,2:Med,...,46,42,76,66,16,85,34,161.927,7340,298.26
3,4,E,Basses-Alpes,12935,7289,46,2733,23018,14238,1:Sm,...,70,12,37,80,32,29,2,351.399,6925,155.9
4,5,E,Hautes-Alpes,17488,8174,69,6962,23076,16171,1:Sm,...,22,23,64,79,35,7,1,320.28,5549,129.1


## smf.ols()
- smf.ols stands for Ordinary Least Squares (OLS) regression in the statsmodels.formula.api module of the statsmodels library in Python. 
- It is used for linear regression analysis using a formula-based approach similar to R.

### How smf.ols() works?
- OLS regression is a statistical method used to model the relationship between a dependent variable (Y) and one or more independent variables (X).
- The smf.ols function allows you to specify a regression model using a formula string, which makes it more intuitive compared to the matrix-based approach.

###  Why use np.log(Pop1831) instead of Pop1831?
- Population (Pop1831) is usually skewed (some areas have very large populations).
- Applying np.log(Pop1831) reduces skewness and makes the relationship more linear.
- It helps meet the assumptions of OLS regression, such as homoscedasticity (constant variance).

### Why is this Regression useful?
- Helps analyze how literacy and population size influence lottery spending.
- The log transformation makes the relationship more linear and interpretable.
- The results can be used to make predictions or identify key factors affecting lottery spending.


In [4]:
#create a model the dataset
results=smf.ols('Lottery ~ Literacy + np.log(Pop1831)',data=data).fit()

In [5]:
results.summary()

0,1,2,3
Dep. Variable:,Lottery,R-squared:,0.348
Model:,OLS,Adj. R-squared:,0.333
Method:,Least Squares,F-statistic:,22.2
Date:,"Tue, 04 Feb 2025",Prob (F-statistic):,1.9e-08
Time:,15:08:23,Log-Likelihood:,-379.82
No. Observations:,86,AIC:,765.6
Df Residuals:,83,BIC:,773.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,246.4341,35.233,6.995,0.000,176.358,316.510
Literacy,-0.4889,0.128,-3.832,0.000,-0.743,-0.235
np.log(Pop1831),-31.3114,5.977,-5.239,0.000,-43.199,-19.424

0,1,2,3
Omnibus:,3.713,Durbin-Watson:,2.019
Prob(Omnibus):,0.156,Jarque-Bera (JB):,3.394
Skew:,-0.487,Prob(JB):,0.183
Kurtosis:,3.003,Cond. No.,702.0


### Advertising data used for ols regression. 
- Created 2 models, 1 with Sales dependent on TV data and 2nd one with Sales dependent on TV,Radio and Newspaper data. Came to conclusion that the later model is better performing than the first one as the r-squared value has increased in 2nd one. From the 2nd model, the Sales changes w.r.t TV and Radio data compared to Newspaper .

In [6]:
import pandas as pd
import statsmodels.formula.api as smf

In [8]:
data=pd.read_csv('Advertising.csv',index_col=0)

In [9]:
data.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [10]:
model=smf.ols(formula ='Sales ~ TV',data=data).fit()

In [11]:
model.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.612
Model:,OLS,Adj. R-squared:,0.61
Method:,Least Squares,F-statistic:,312.1
Date:,"Tue, 04 Feb 2025",Prob (F-statistic):,1.47e-42
Time:,16:06:31,Log-Likelihood:,-519.05
No. Observations:,200,AIC:,1042.0
Df Residuals:,198,BIC:,1049.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.0326,0.458,15.360,0.000,6.130,7.935
TV,0.0475,0.003,17.668,0.000,0.042,0.053

0,1,2,3
Omnibus:,0.531,Durbin-Watson:,1.935
Prob(Omnibus):,0.767,Jarque-Bera (JB):,0.669
Skew:,-0.089,Prob(JB):,0.716
Kurtosis:,2.779,Cond. No.,338.0


In [12]:
model.params

Intercept    7.032594
TV           0.047537
dtype: float64

In [13]:
model.conf_int()

Unnamed: 0,0,1
Intercept,6.129719,7.935468
TV,0.042231,0.052843


In [14]:
model.pvalues

Intercept    1.406300e-35
TV           1.467390e-42
dtype: float64

In [15]:
model.rsquared

0.611875050850071

In [16]:
model=smf.ols(formula ='Sales ~ TV + Radio + Newspaper',data=data).fit()

In [17]:
model.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Tue, 04 Feb 2025",Prob (F-statistic):,1.58e-96
Time:,16:09:32,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
Radio,0.1885,0.009,21.893,0.000,0.172,0.206
Newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


In [18]:
model.params

Intercept    2.938889
TV           0.045765
Radio        0.188530
Newspaper   -0.001037
dtype: float64

In [20]:
model.pvalues

Intercept    1.267295e-17
TV           1.509960e-81
Radio        1.505339e-54
Newspaper    8.599151e-01
dtype: float64

In [21]:
model.rsquared

0.8972106381789522

### Examples using SciPy

In [22]:
from scipy import constants

In [23]:
dir(constants)

['Avogadro',
 'Boltzmann',
 'Btu',
 'Btu_IT',
 'Btu_th',
 'G',
 'Julian_year',
 'N_A',
 'Planck',
 'R',
 'Rydberg',
 'Stefan_Boltzmann',
 'Wien',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_codata',
 '_constants',
 '_obsolete_constants',
 'acre',
 'alpha',
 'angstrom',
 'arcmin',
 'arcminute',
 'arcsec',
 'arcsecond',
 'astronomical_unit',
 'atm',
 'atmosphere',
 'atomic_mass',
 'atto',
 'au',
 'bar',
 'barrel',
 'bbl',
 'blob',
 'c',
 'calorie',
 'calorie_IT',
 'calorie_th',
 'carat',
 'centi',
 'codata',
 'constants',
 'convert_temperature',
 'day',
 'deci',
 'degree',
 'degree_Fahrenheit',
 'deka',
 'dyn',
 'dyne',
 'e',
 'eV',
 'electron_mass',
 'electron_volt',
 'elementary_charge',
 'epsilon_0',
 'erg',
 'exa',
 'exbi',
 'femto',
 'fermi',
 'find',
 'fine_structure',
 'fluid_ounce',
 'fluid_ounce_US',
 'fluid_ounce_imp',
 'foot',
 'g',
 'gallon',
 'gallon_US',
 'gallon_imp',
 'gas_co

In [25]:
constants.Avogadro

6.02214076e+23

In [26]:
constants.year

31536000.0

In [27]:
constants.gram

0.001

In [28]:
constants.degree

0.017453292519943295

In [31]:
#converting degree into radians without scipy package
import numpy as np
np.sin(45*np.pi/180)

0.7071067811865476

In [32]:
#converting degree into radians using scipy package
np.sin(45*constants.degree)

0.7071067811865476

5.0