In [1]:
import numpy as np
import pandas as pd

# Numpy

`Numpy` (Numerical Python) is one of the most important foundational packages for numerical computing in Python.
<br/>
One of the reason Numpy is important for numerical computations in Python is because it is designed for efficiency on large arrays of data

<ul>
    <li>
        Numpy internally stores data in a contiguous block of memory, independent of other built-in Python objects
    </li>
    <li>
        Uses much less memory than Python built-in sequences
    </li>
    <li>
        Numpy operations perform complex computations on entire arrays without the need for Python `for` loops
    </li>
</ul>


In [10]:
my_arr = np.arange(1_000_000)
%timeit my_arr2 = my_arr * 2

3.61 ms ± 74.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
my_list = list(range(1_000_000))
%timeit my_list2 = [x * 2 for x in my_list]

85.7 ms ± 2.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Thea above result shows that if every item of numpy array is multiplied by 2, it takes roughly `3.61 ms` while the same operation on list takes nearly `85.7ms`


## Numpy array

One of the key featues of NumPy is its n-dimensional array object. Array enables you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements. `In other words, it allows you to apply scalar operations to arrays that will otherwise require a loop in Python built-in list`


In [19]:
example_data = np.array([
    [1, 0.5, 6],
    [5, -3, 5.5]
])
print(f"Data: {example_data}\n")

print(f"Data doubled: {example_data * 2}\n")

print(f"Data added with itself: {example_data + example_data}")

Data: [[ 1.   0.5  6. ]
 [ 5.  -3.   5.5]]

Data doubled: [[ 2.  1. 12.]
 [10. -6. 11.]]

Data added with itself: [[ 2.  1. 12.]
 [10. -6. 11.]]


numpy array is a multidimensional container for homogeneous data (all of the elements must be of same data type)


In [21]:
# example_data = np.array([
#     [1, 0.5, 6],
#     [5, 'hello', 5.5]
# ])

# This will give error because an str is based in an float array. The integers are parsed to float

Easiest way to create an array is using `array()`


In [32]:
example_data = [6, 0.2, 3.6, 4.2]
example_data_array = np.array(example_data)
print(f'Array: {example_data_array}') # it may seems like a simple list but its type is different

print(f"Example data type: {type(example_data)}")
print(f"Example data array type: {type(example_data_array)}")

print(f"Example array dimensions: {example_data_array.ndim}")
print(f'Example array shape: {example_data_array.shape}')
print(f'Example array size: {example_data_array.size}')
print(f"Example array data type: {example_data_array.dtype}")

Array: [6.  0.2 3.6 4.2]
Example data type: <class 'list'>
Example data array type: <class 'numpy.ndarray'>
Example array dimensions: 1
Example array shape: (4,)
Example array size: 4
Example array data type: float64


You can explicitly cast an array from one type to another using ndarray `astype()`


In [68]:
print(f'Example array: {example_data_array}')

print(f'Example array type: {example_data_array.dtype}\n')

int_example_array = example_data_array.astype(np.int64)

print(f"Example array after type conversion to int: {int_example_array}")

print(f'Example array type: {int_example_array.dtype}\n')

Example array: [6.  0.2 3.6 4.2]
Example array type: float64

Example array after type conversion to int: [6 0 3 4]
Example array type: int64



If an array is of string numbers, it can be converted to float or integer. If casting were to fail (example string cannot be converted to integer or float), a `ValueError` will be raised


In [84]:
string_numbers_array = np.array(['1.2', '4.4', '0.5', '-1.6'])
print("Strings array: ", string_numbers_array)
print("Float numbers array type: ", string_numbers_array.dtype)

float_numbers_array = string_numbers_array.astype(np.float64)
print("\nFloat numbers array: ", float_numbers_array)
print("Float numbers array type: ", float_numbers_array.dtype)

Strings array:  ['1.2' '4.4' '0.5' '-1.6']
Float numbers array type:  <U4

Float numbers array:  [ 1.2  4.4  0.5 -1.6]
Float numbers array type:  float64


## Numpy functions


NumPy provides other functions for creating new arrays


In [58]:
print(f"Zero function: {np.zeros(5)}\n") #prints an array of zero

print(f"Zero matrix: {np.zeros((3,3))}\n")

print(f"Empty array: {np.empty((2,3))}\n") #prints an array with garbage values

print(f"Type of numpy.empty: {type(np.empty((2,2)))}\n")

print(f"Identity matrix: {np.eye(3)}\n") #identity matrix

print(f"Random matrix: {np.random.random((2, 2))}\n") #random matrix

print(f"Matrix of one: {np.ones((2,3))}\n")

print(f"Matrix of one while taking an another array as input: {np.ones_like(example_data)}\n") #ones_like takes another array and produces an array of ones array of same shape and size

print(f"Full numpy matrix: {np.full(3,5)}")

Zero function: [0. 0. 0. 0. 0.]

Zero matrix: [[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]

Empty array: [[1. 1. 1.]
 [1. 1. 1.]]

Type of numpy.empty: <class 'numpy.ndarray'>

Identity matrix: [[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

Random matrix: [[0.51772468 0.0988314 ]
 [0.00306367 0.47616796]]

Matrix of one: [[1. 1. 1.]
 [1. 1. 1.]]

Matrix of one while taking an another array as input: [1. 1. 1. 1.]

Full numpy matrix: [5 5 5]


## Numpy Arrays Arithmetic

An important feature of a numpy array is that you can apply arithmetic operations without using loops. This is called `vectorization` <br/> <br/>
Vectorization refers to the process of performing operations on entire array at once rather iterating through individual elements using loop, leading to faster and more efficient code


In [44]:
example_array = np.array([[1., 2., 3.], [4., 5., 6.]])
print(f'Example array {example_array}')

Example array [[1. 2. 3.]
 [4. 5. 6.]]


In [89]:
print(f"Example array double: {example_array * 2}\n")

print(f"Example array square: {example_array * example_array}\n")

print(f"1 divided by example array {1/example_array}\n")

Example array double: [[ 2.  4.  6.]
 [ 8. 10. 12.]]

Example array square: [[ 1.  4.  9.]
 [16. 25. 36.]]

1 divided by example array [[1.         0.5        0.33333333]
 [0.25       0.2        0.16666667]]



We can compare two arrays. The comparison is done by element wise


In [90]:
example_array_2 = np.array([[0., 4., 1.], [7., 2., 12.]])

print(f"First example array: {example_array}\n")
print(f"Second example array: {example_array_2}")

First example array: [[1. 2. 3.]
 [4. 5. 6.]]

Second example array: [[ 0.  4.  1.]
 [ 7.  2. 12.]]


In [91]:
example_array > example_array_2

array([[ True, False,  True],
       [False,  True, False]])

In [93]:
example_array < example_array_2

array([[False,  True, False],
       [ True, False,  True]])

## Indexing and Slicing


Indexing and slicing of one-dimensional arrays are simple; on the surface they act similarly to Python lists


In [10]:
simple_array = np.arange(10)
print(f"Array: {simple_array}\n")

print(f"First element of array: {simple_array[0]}")
print(f"Last element of array {simple_array[-1]}")
print(f"3 to 5 element in array: {simple_array[2:5]}\n")

simple_array[5] = 15
print(f"Updating element at index 6: {simple_array}")

simple_array[0:2] = 56
print(f"Updating first 2 elements: {simple_array}")

Array: [0 1 2 3 4 5 6 7 8 9]

First element of array: 0
Last element of array 9
3 to 5 element in array: [2 3 4]

Updating element at index 6: [ 0  1  2  3  4 15  6  7  8  9]
Updating first 2 elements: [56 56  2  3  4 15  6  7  8  9]


In a two-dimensional array, the elements at each index are no longer scalars but rather one dimensional arrays


In [20]:
two_dimensional_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(f"Two dimensional array: {two_dimensional_array}")

print(f"\nFirst element: {two_dimensional_array[0][0]}")
print(f"Last element: {two_dimensional_array[2][2]}")

print(f"\nSecond row: {two_dimensional_array[1]}")
print(f"3 column: {two_dimensional_array[:, 2]}")
print(f"2 column: {two_dimensional_array[:, 1]}")

Two dimensional array: [[1 2 3]
 [4 5 6]
 [7 8 9]]

First element: 1
Last element: 9

Second row: [4 5 6]
3 column: [3 6 9]
2 column: [2 5 8]


In [42]:
print(f"Second and third row: {two_dimensional_array[1:3]}")
print(f"2 and third column: {two_dimensional_array[:, 1:3]}")

Second and third row: [[4 5 6]
 [7 8 9]]
2 and third column: [[2 3]
 [5 6]
 [8 9]]


Lets check for multi-dimensional arrays


In [37]:
three_dimensional_array = np.array([[[1, 2, 3], [4, 5, 6]],[[7, 8, 9], [10, 11, 12]]])

print(f"Multi dimensional array: {three_dimensional_array}")
print(f"\nMulti dimensional array dimension: {three_dimensional_array.ndim}")
print(f"Multi dimensional array shape: {three_dimensional_array.shape}")
print(f"Multi dimensional array size: {three_dimensional_array.size}")

print(f"\n First two dim array {three_dimensional_array[0]}") # returns a two dimensional array
print(f"Second one dim array {three_dimensional_array[1][1]}") # returns a one dimensional array
print(f"Third element of second one dim array {three_dimensional_array[1][1][2]}") # returns a scalar value

Multi dimensional array: [[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]

Multi dimensional array dimension: 3
Multi dimensional array shape: (2, 2, 3)
Multi dimensional array size: 12

 First two dim array [[1 2 3]
 [4 5 6]]
Second one dim array [10 11 12]
Third element of second one dim array 12


`Fancy indexing` refers to the practice of using arrays or lists of indices to access multiple elements of another array simultaneously. This allows for more flexible and powerful data manipuation


In [50]:
example_array = np.array([10, 20, 30, 40, 50])
print(f'Example array: {example_array}')
indices = [1, 3, 4]

print(f"Selected elements: {example_array[indices]}")

print(f"\nTwo dimensional array: {two_dimensional_array}")

# Selects elements at (0, 2), (1, 1), and (2, 0)
print(f"Selected elements: {two_dimensional_array[[0, 1, 2], [2, 1, 0]]}")

Example array: [10 20 30 40 50]
Selected elements: [20 40 50]

Two dimensional array: [[1 2 3]
 [4 5 6]
 [7 8 9]]
Selected elements: [3 5 7]


`Transposing` is a special form of reshaping that interchanges the column data with row data. `The row becomes columns and the columns become rows`


In [53]:
example_array = np.arange(15).reshape((3, 5))
print(f"Example array:\n {example_array}")

print(f"\nExample array transpose: {example_array.T}")

Example array:
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]

Example array transpose: [[ 0  5 10]
 [ 1  6 11]
 [ 2  7 12]
 [ 3  8 13]
 [ 4  9 14]]


We can also do `dot multiplication`


In [54]:
print(f"Dot product of example array and transpose of example array: {np.dot(example_array, example_array.T)}")

Dot product of example array and transpose of example array: [[ 30  80 130]
 [ 80 255 430]
 [130 430 730]]


`numpy.random` is used to generate array of random data


In [56]:
samples = np.random.standard_normal(size=(4, 4))
print(f"Sample data: \n{samples}")

Sample data: 
[[-0.12661489 -1.05853698  1.06400536  0.5235279 ]
 [-0.16385502  0.45135961 -0.05636292  0.21638825]
 [ 1.79322191  0.76463311  0.31075277 -0.44346989]
 [ 1.37945361  0.0157017  -1.50283368 -0.09045355]]


In [59]:
random_number_generator = np.random.default_rng(seed = 12345) # seed argument determines the initial state of the generator

samples = random_number_generator.standard_normal((2, 3))

print(f"Sample data: \n{samples}")

Sample data: 
[[-1.42382504  1.26372846 -0.87066174]
 [-0.25917323 -0.07534331 -0.74088465]]


Numpy has `universal function` that performs element-wise operations on data in arrays


In [68]:
print(f"Samples: \n{samples}")
print(f"\nSquare of array elements: \n{np.square(samples)}")



Samples: 
[[-1.42382504  1.26372846 -0.87066174]
 [-0.25917323 -0.07534331 -0.74088465]]

Square of array elements: 
[[2.02727773 1.59700962 0.75805186]
 [0.06717077 0.00567661 0.54891007]]


`numpy.meshgrid()` takes one-dimensional array and produces two-dimensional matrics corresponding to all pairs of (x, y) in the two arrays


In [70]:
points = np.arange(-5, 5, 0.01)
print(f"Points: {points}")

Points: [-5.0000000e+00 -4.9900000e+00 -4.9800000e+00 -4.9700000e+00
 -4.9600000e+00 -4.9500000e+00 -4.9400000e+00 -4.9300000e+00
 -4.9200000e+00 -4.9100000e+00 -4.9000000e+00 -4.8900000e+00
 -4.8800000e+00 -4.8700000e+00 -4.8600000e+00 -4.8500000e+00
 -4.8400000e+00 -4.8300000e+00 -4.8200000e+00 -4.8100000e+00
 -4.8000000e+00 -4.7900000e+00 -4.7800000e+00 -4.7700000e+00
 -4.7600000e+00 -4.7500000e+00 -4.7400000e+00 -4.7300000e+00
 -4.7200000e+00 -4.7100000e+00 -4.7000000e+00 -4.6900000e+00
 -4.6800000e+00 -4.6700000e+00 -4.6600000e+00 -4.6500000e+00
 -4.6400000e+00 -4.6300000e+00 -4.6200000e+00 -4.6100000e+00
 -4.6000000e+00 -4.5900000e+00 -4.5800000e+00 -4.5700000e+00
 -4.5600000e+00 -4.5500000e+00 -4.5400000e+00 -4.5300000e+00
 -4.5200000e+00 -4.5100000e+00 -4.5000000e+00 -4.4900000e+00
 -4.4800000e+00 -4.4700000e+00 -4.4600000e+00 -4.4500000e+00
 -4.4400000e+00 -4.4300000e+00 -4.4200000e+00 -4.4100000e+00
 -4.4000000e+00 -4.3900000e+00 -4.3800000e+00 -4.3700000e+00
 -4.3600000e+00 

In [74]:
xs, ys = np.meshgrid(points, points)
print(f"X points: \n{xs}")
print(f"\nY points: \n{ys}")

X points: 
[[-5.   -4.99 -4.98 ...  4.97  4.98  4.99]
 [-5.   -4.99 -4.98 ...  4.97  4.98  4.99]
 [-5.   -4.99 -4.98 ...  4.97  4.98  4.99]
 ...
 [-5.   -4.99 -4.98 ...  4.97  4.98  4.99]
 [-5.   -4.99 -4.98 ...  4.97  4.98  4.99]
 [-5.   -4.99 -4.98 ...  4.97  4.98  4.99]]

Y points: 
[[-5.   -5.   -5.   ... -5.   -5.   -5.  ]
 [-4.99 -4.99 -4.99 ... -4.99 -4.99 -4.99]
 [-4.98 -4.98 -4.98 ... -4.98 -4.98 -4.98]
 ...
 [ 4.97  4.97  4.97 ...  4.97  4.97  4.97]
 [ 4.98  4.98  4.98 ...  4.98  4.98  4.98]
 [ 4.99  4.99  4.99 ...  4.99  4.99  4.99]]


`numpy.where()` function is a vectorized verison of terenary expression (x if condition else y)


In [81]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

result = [(x if c else y)for x, y, c in zip(xarr, yarr, cond)]
print(f"Result of traditional terenay operation \n{result}")

result = np.where(cond, xarr, yarr)
print(f"\nResult of np.where {result}")

Result of traditional terenay operation 
[np.float64(1.1), np.float64(2.2), np.float64(1.3), np.float64(1.4), np.float64(2.5)]

Result of np.where [1.1 2.2 1.3 1.4 2.5]


We can sort using `.sort()`


In [90]:
arr2 = np.array([5, -10, 7, 1, 0, -3])
print(f"Samples: \n{arr2}")

sorted_arr2 = np.sort(arr2)

print(f"Sorted array: \n{sorted_arr2}")

Samples: 
[  5 -10   7   1   0  -3]
Sorted array: 
[-10  -3   0   1   5   7]


In [95]:
samples = np.random.standard_normal((5, 3))
print(f"Sample: \n{samples}")

samples.sort(axis=0)
print(f"\nSample sorted by column \n{samples}")

samples.sort(axis=1)
print(f"\nSample sorted by row \n{samples}")

Sample: 
[[-0.91318408 -0.63387045 -0.11497933]
 [ 0.52476072  0.76283355 -0.9783067 ]
 [ 0.05602162 -0.02295575 -0.04968797]
 [-0.33473125 -0.04034099  1.11457037]
 [-0.20547436  1.60427167  1.21349609]]

Sample sorted by column 
[[-0.91318408 -0.63387045 -0.9783067 ]
 [-0.33473125 -0.04034099 -0.11497933]
 [-0.20547436 -0.02295575 -0.04968797]
 [ 0.05602162  0.76283355  1.11457037]
 [ 0.52476072  1.60427167  1.21349609]]

Sample sorted by row 
[[-0.9783067  -0.91318408 -0.63387045]
 [-0.33473125 -0.11497933 -0.04034099]
 [-0.20547436 -0.04968797 -0.02295575]
 [ 0.05602162  0.76283355  1.11457037]
 [ 0.52476072  1.21349609  1.60427167]]


We can find unique values in an array using `np.unique()`


In [96]:
print(f"Unique values in samples {np.unique(samples)}")

Unique values in samples [-0.9783067  -0.91318408 -0.63387045 -0.33473125 -0.20547436 -0.11497933
 -0.04968797 -0.04034099 -0.02295575  0.05602162  0.52476072  0.76283355
  1.11457037  1.21349609  1.60427167]


# Pandas


`pandas` is powerful and widely open source data manipulation and analysis library. It is built on top of NumPy and provides data structure and functions needed to work with structured data seamlessly


It is commonly used for

<ul>
    <li>
        Data cleaning and pre-processing like handling missing values, transforming data types, and normalizing data
    </li>
    <li>
        Exploratory data analysis that includes summarizing data, calculating statisitcs, and visualizing data
    </li>
    <li>
        Data wrangling that includes reshaping, merging, and filtering large datasets
    </li>
    <li>
        Data transformation that includes applying functions and transformations to data
    </li>
    <li>
        Time series analysis that includes working with date-time data for analysis and forecasting
    </li>
</ul>


## Series


`Series` is a one-dimensional array like object containing a sequence of values of the same type and an associated array of data labels called index


In [3]:
companies = ['Apple', 'Samsung', 'Alphabet', 'Foxconn', 'Microsoft', 'Huawei', 'Dell Technlogoies', 'Meta', 'Sony']

revenue = [274_515, 200_734, 182_257, 181_945, 143_015, 129_184, 92_224, 85_965, 84_893]

annual_report = pd.Series(revenue, index=companies, name='Annual Revenue')

display(annual_report)

Apple                274515
Samsung              200734
Alphabet             182257
Foxconn              181945
Microsoft            143015
Huawei               129184
Dell Technlogoies     92224
Meta                  85965
Sony                  84893
Name: Annual Revenue, dtype: int64

The main component of series are:

<ul>
    <li>
        Data:
        <ul>
            <li>
                This is the most important component of the series. This the data we want to represent
            </li>
        </ul>
    </li>
    <li>
        Index:
        <ul>
            <li>
                The index indicates the "labels" of the data we are storing. It is not a required parameter as pandas assign default sequential numbers if index is not provided
            </li>
        </ul>
    </li>
    <li>
        Name:
        <ul>
            <li>
                This is sort of documentation that gives name to the table. This is also optional
            </li>
        </ul>
    </li>
</ul>


Series are `strongly typed`; they have an enforced object type. A series of dtype int64 will consist only int64


We can get data of Series by `index`, by `position` and also by multiple of those
<br/> <br/>
We use the Series index to reference and locate date associated with it. `Series.loc['index']` is the prefered method to reference values


In [120]:
print(f"Annual revenue of Apple: ${annual_report['Apple']}") # get the value associated with index 'Apple'
print(f"Annual Revenue of Meta: ${annual_report.get('Meta')}") # another way to get data associated with an index

# print(f"Annual revenue of Nvidia: ${annual_report['Nvidia']}") # This will give error because there is no index Nvidia
# alternative is to use .get() which handles error properly

print(f"Annual revenue of Nvidia: ${annual_report.get('Nvidia')}")

print(f"Annual revenue of Samsung: ${annual_report.loc['Samsung']}") # another method to get associated with key

print(f"Annual revenue of Alphabet and Dell: \n{annual_report.loc[['Alphabet', 'Dell Technlogoies']]}")

Annual revenue of Apple: $274515
Annual Revenue of Meta: $85965
Annual revenue of Nvidia: $None
Annual revenue of Samsung: $200734
Annual revenue of Alphabet and Dell: 
Alphabet             182257
Dell Technlogoies     92224
Name: Annual Revenue, dtype: int64


We can select elements by their position or order. `.iloc[]` method is used


In [123]:
print(f"First data in the series: {annual_report.iloc[1]}")
print(f"Last data in the series: {annual_report.iloc[-1]}")
print(f"Data at first, 5 and last index: \n{annual_report.iloc[[0, 5, -1]]}")

First data in the series: 200734
Last data in the series: 84893
Data at first, 5 and last index: 
Apple     274515
Huawei    129184
Sony       84893
Name: Annual Revenue, dtype: int64


A series can be converted back to a dictionary using `to_dict()` method


In [124]:
print(f"Dictionary version of the series: {annual_report.to_dict()}")

Dictionary version of the series: {'Apple': 274515, 'Samsung': 200734, 'Alphabet': 182257, 'Foxconn': 181945, 'Microsoft': 143015, 'Huawei': 129184, 'Dell Technlogoies': 92224, 'Meta': 85965, 'Sony': 84893}


The `isna()` function is used to detect missing, NA or null values


In [125]:
print(f"Missing values in series: {pd.isna(annual_report)}")

Missing values in series: Apple                False
Samsung              False
Alphabet             False
Foxconn              False
Microsoft            False
Huawei               False
Dell Technlogoies    False
Meta                 False
Sony                 False
Name: Annual Revenue, dtype: bool


To check if the values is not null or NA or missing, you can use `notna()`


In [126]:
print(f"Non-null values in the series: {pd.notna(annual_report)}")

Non-null values in the series: Apple                True
Samsung              True
Alphabet             True
Foxconn              True
Microsoft            True
Huawei               True
Dell Technlogoies    True
Meta                 True
Sony                 True
Name: Annual Revenue, dtype: bool


Series contain a lot of useful attributes and methods to interact with them. The most common methods are `.head()` [shows top 5 rows of the series] and `.tail()` [show bottom 5 of the series]


In [128]:
print("Top five rows of the annual report")
annual_report.head()

Top five rows of the annual report


Apple        274515
Samsung      200734
Alphabet     182257
Foxconn      181945
Microsoft    143015
Name: Annual Revenue, dtype: int64

In [129]:
print("Bottom five rows of the annual report")
annual_report.tail()

Bottom five rows of the annual report


Microsoft            143015
Huawei               129184
Dell Technlogoies     92224
Meta                  85965
Sony                  84893
Name: Annual Revenue, dtype: int64

Once a series is constructed, we can access all the attributes separately. The main are:

<ul>
    <li>
        .values (data of the series)
    </li>
    <li>
        .index (index of the series)
    </li>
    <li>
        .name (name of the series)
    </li>
    <li>
        .dtype (type assigned)
    </li>
    <li>
        .size (number of elements)
    </li>
</ul>


In [134]:
print(f"Values of the annual report series: \n{annual_report.values}")

print(f"\nIndexes of the annual report series: \n{annual_report.index}")

print(f"\nName of the series: {annual_report.name}")

print(f"Type of the values of the series: {annual_report.dtype}")
print(f"Number of elements in the series: {annual_report.size}")

Values of the annual report series: 
[274515 200734 182257 181945 143015 129184  92224  85965  84893]

Indexes of the annual report series: 
Index(['Apple', 'Samsung', 'Alphabet', 'Foxconn', 'Microsoft', 'Huawei',
       'Dell Technlogoies', 'Meta', 'Sony'],
      dtype='object')

Name of the series: Annual Revenue
Type of the values of the series: int64
Number of elements in the series: 9


The `.describe()` gives a quick summary of statistics of the series


In [135]:
print(f"Quick summary of the annual report series")
annual_report.describe()

Quick summary of the annual report series


count         9.000000
mean     152748.000000
std       63472.949406
min       84893.000000
25%       92224.000000
50%      143015.000000
75%      182257.000000
max      274515.000000
Name: Annual Revenue, dtype: float64

There are also other statisitical methods


In [139]:
print(f"Median of the annual report series: {annual_report.median()}")

print(f"\nStandard deviation of the annual report series: {annual_report.std()}")

print(f"\nVariance of the annual report series: {annual_report.var()}")

print(f"\nMaximum value of the annual report series:{annual_report.max()}")

Median of the annual report series: 143015.0

Standard deviation of the annual report series: 63472.94940563263

Variance of the annual report series: 4028815306.25

Maximum value of the annual report series:274515


Sorting series is extermely simple. You can sort series by values (`.sort_values()`) or by index (`.sort_index()`). By default, it is sorted in ascending order.


In [147]:
print(f"Annual report series: \n{annual_report}")

print(f"\nSeries sorted by values: \n{annual_report.sort_values().head()}")

print(f"\nSeries sorted by index in descending order: \n{annual_report.sort_index(ascending=False).head()}")

Annual report series: 
Apple                274515
Samsung              200734
Alphabet             182257
Foxconn              181945
Microsoft            143015
Huawei               129184
Dell Technlogoies     92224
Meta                  85965
Sony                  84893
Name: Annual Revenue, dtype: int64

Series sorted by values: 
Sony                  84893
Meta                  85965
Dell Technlogoies     92224
Huawei               129184
Microsoft            143015
Name: Annual Revenue, dtype: int64

Series sorted by index in descending order: 
Sony          84893
Samsung      200734
Microsoft    143015
Meta          85965
Huawei       129184
Name: Annual Revenue, dtype: int64


An important concept in data science and pandas is immutabilty. `In data science, we dont want to mutate or change things as it is harder to keep track of these changes`
<br/><br/>
By default, when you sort series or perform other mutating operations, you dont `actually` sort the series itself,. There's a `new` series returned. The underlying series is not changed

<br/> <br/>
If you do want to mutate your series, you must pass `inplace=True` attribute


In [165]:
print(f"Sorted series: \n{annual_report.sort_values()}")

print(f"\n\nOriginal series after sorting: \n{annual_report}" ) # the original series is not mutated

print('\n-----------------------------------------\n')

annual_report.sort_values(inplace=True)

print(f"Original series after sorting: \n{annual_report}" ) # the original series is also mutated

Sorted series: 
Sony                  84893
Meta                  85965
Dell Technlogoies     92224
Huawei               129184
Microsoft            143015
Foxconn              181945
Alphabet             182257
Samsung              200734
Apple                274515
Name: Annual Revenue, dtype: int64


Original series after sorting: 
Sony                  84893
Meta                  85965
Dell Technlogoies     92224
Huawei               129184
Microsoft            143015
Foxconn              181945
Alphabet             182257
Samsung              200734
Apple                274515
Name: Annual Revenue, dtype: int64

-----------------------------------------

Original series after sorting: 
Sony                  84893
Meta                  85965
Dell Technlogoies     92224
Huawei               129184
Microsoft            143015
Foxconn              181945
Alphabet             182257
Samsung              200734
Apple                274515
Name: Annual Revenue, dtype: int64


Modifying a series is something that we hardly want to do as it is not recommended. However, it is still possible by changing values, adding or removing elements


In [8]:
print("Annual report")
display(annual_report)

print("Changing value of Microsoft\n")

annual_report['Microsoft'] = 320_120
print("Annual report after modification")
display(annual_report)

annual_report['Nvidia'] = 292_564
del annual_report['Sony']

print("Annual report after deleting Sony data and adding Nvidia data")
display(annual_report)

Annual report


Apple                274515
Samsung              200734
Alphabet             182257
Foxconn              181945
Microsoft            320120
Huawei               129184
Dell Technlogoies     92224
Meta                  85965
Sony                  84893
Name: Annual Revenue, dtype: int64

Changing value of Microsoft

Annual report after modification


Apple                274515
Samsung              200734
Alphabet             182257
Foxconn              181945
Microsoft            320120
Huawei               129184
Dell Technlogoies     92224
Meta                  85965
Sony                  84893
Name: Annual Revenue, dtype: int64

Annual report after deleting Sony data and adding Nvidia data


Apple                274515
Samsung              200734
Alphabet             182257
Foxconn              181945
Microsoft            320120
Huawei               129184
Dell Technlogoies     92224
Meta                  85965
Nvidia               292564
Name: Annual Revenue, dtype: int64

We can concatenate two series using the `concat()` method
<br/> <br/>
`series.concat([DF1/Series1, DF2/Series2])`


In [11]:
new_data = pd.Series([150_635, 265_530], index= ['Tesla', 'Google'])

new_annual_report = pd.concat([annual_report, new_data])
display(new_annual_report)

Apple                274515
Samsung              200734
Alphabet             182257
Foxconn              181945
Microsoft            320120
Huawei               129184
Dell Technlogoies     92224
Meta                  85965
Nvidia               292564
Tesla                150635
Google               265530
dtype: int64

### Filtering and Conditional in Series


`Conditional selection` is like filtering or querying. It allow us to answer the following types of questions

<ul>
    <li>
        What companies made more than $x
    </li>
    <li>
        What companies made less than $x
    </li>
    <li>
        What companies made between $x and $y
    </li>
</ul>


`Boolean array` is a way of selecting in which we pass the full index of the series and indicate what elements we want to select and which one we want to skip. We indicate this by passing Boolean values; True or False


In [23]:
display(annual_report)
print(f"\nSelecting certain companies: \n")
display(annual_report.loc[[
    True,      # Apple
    False,     # Samsung
    True,      # Alphabet
    False,     # Foxconn
    True,      # Microsoft
    False,     # Huawei
    True,      # Dell
    True,      # Meta
    False,     # Sony
]])

Apple                274515
Samsung              200734
Alphabet             182257
Foxconn              181945
Microsoft            143015
Huawei               129184
Dell Technlogoies     92224
Meta                  85965
Sony                  84893
Name: Annual Revenue, dtype: int64


Selecting certain companies: 



Apple                274515
Alphabet             182257
Microsoft            143015
Dell Technlogoies     92224
Meta                  85965
Name: Annual Revenue, dtype: int64

Only those values are selected for which we have True value


Moreover, series accept comparison operators (boolean operator) like `>` `<` etc


In [26]:
print(f"Companies which revenue exceeds the $200 billion dollar \n{annual_report.loc[annual_report>200_000]}")

Companies which revenue exceeds the $200 billion dollar 
Apple      274515
Samsung    200734
Name: Annual Revenue, dtype: int64


In [27]:
print(f"Companies with revenue less than $90,000 million dollar \n{annual_report.loc[annual_report < 90_000]}")

Companies with revenue less than $90,000 million dollar 
Meta    85965
Sony    84893
Name: Annual Revenue, dtype: int64


In [29]:
print(f"Companies with revenue above average: \n{annual_report.loc[annual_report > annual_report.mean()]}")

Companies with revenue above average: 
Apple       274515
Samsung     200734
Alphabet    182257
Foxconn     181945
Name: Annual Revenue, dtype: int64


In [42]:
print(f"Companies with revenue greater than $200 billion or less than $90000 million ")
annual_report.loc[(annual_report < 90_000) | (annual_report > 150_000)]

Companies with revenue greater than $200 billion or less than $90000 million 


Apple       274515
Samsung     200734
Alphabet    182257
Foxconn     181945
Meta         85965
Sony         84893
Name: Annual Revenue, dtype: int64

In [43]:
print(f"Companies with revenue not less than $150 billion: {annual_report.loc[~(annual_report<150_000)]}")

Companies with revenue not less than $150 billion: Apple       274515
Samsung     200734
Alphabet    182257
Foxconn     181945
Name: Annual Revenue, dtype: int64


In [46]:
print(f"Companies with most and least revenue:\n{annual_report.loc[(annual_report == annual_report.max()) | (annual_report == annual_report.min())]}")

Companies with most and least revenue:
Apple    274515
Sony      84893
Name: Annual Revenue, dtype: int64


In [49]:
print("Count of each value")
annual_report.value_counts()

Count of each value


Annual Revenue
274515    1
200734    1
182257    1
181945    1
143015    1
129184    1
92224     1
85965     1
84893     1
Name: count, dtype: int64

We can perform arithmetic process on series (just like we do in numpy known as vectorization)


In [52]:
revenue_in_billions = annual_report / 1000 # arithmetic operations as it is scalar. Same in numpy. Applicable in pandas because pandas is built on top of numpy

print(f"Revenue of companies in billion")
display(revenue_in_billions)

Revenue of companies in billion


Apple                274.515
Samsung              200.734
Alphabet             182.257
Foxconn              181.945
Microsoft            143.015
Huawei               129.184
Dell Technlogoies     92.224
Meta                  85.965
Sony                  84.893
Name: Annual Revenue, dtype: float64

In [53]:
print(f"Subtracting 10 million taxes")
revenue_in_billions - 10

Subtracting 10 million taxes


Apple                264.515
Samsung              190.734
Alphabet             172.257
Foxconn              171.945
Microsoft            133.015
Huawei               119.184
Dell Technlogoies     82.224
Meta                  75.965
Sony                  74.893
Name: Annual Revenue, dtype: float64

In [54]:
print(f"Annual revenue in dollars {annual_report * 1_000_000}")

Annual revenue in dollars Apple                274515000000
Samsung              200734000000
Alphabet             182257000000
Foxconn              181945000000
Microsoft            143015000000
Huawei               129184000000
Dell Technlogoies     92224000000
Meta                  85965000000
Sony                  84893000000
Name: Annual Revenue, dtype: int64


In [56]:
recession_impact = pd.Series([
    0.91, 0.93, 0.98, 0.97, 0.99, 0.89, 0.87,
    0.82, 0.93], index=companies)

We can add, subtract, divide or multiply series with same data length


In [59]:
print(f"Revenue after recession hit {annual_report * recession_impact}\n")

Revenue after recession hit Apple                249808.65
Samsung              186682.62
Alphabet             178611.86
Foxconn              176486.65
Microsoft            141584.85
Huawei               114973.76
Dell Technlogoies     80234.88
Meta                  70491.30
Sony                  78950.49
dtype: float64

