
# Python Libraries for Data Analysis: `numpy` and `pandas`

`numpy` and `pandas` are two of the most frequently used libraries for data analysis in Python.

- `numpy`: This library provides support for numerical operations like linear algebra, Fourier transform, and random number capabilities. It's fundamental for scientific computing with Python.

- `pandas`: This library provides data structures and data analysis tools. It's easily the most widely-used tool for data munging and preparation.

Before using these libraries, we need to import them. The convention is to import `numpy` as `np` and `pandas` as `pd` as shown below.


In [None]:
import numpy as np
import pandas as pd


## Numpy

Numpy, which stands for 'Numerical Python', is a library consisting of multidimensional array objects and a collection of routines for processing of array. It is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.

### Creating Arrays

There are several ways to create arrays in Numpy. Here are a few examples:

- `np.array()`: This function is used when we want to create an array from a list or tuple.

- `np.zeros()`: This function is used when we want to create an array filled with zeroes.

- `np.ones()`: This function is used when we want to create an array filled with ones.

- `np.empty()`: This function is used when we want to create an array without initializing entries.

- `np.arange()`: This function returns evenly spaced values within a given interval.

- `np.linspace()`: This function is similar to `np.arange()`, but instead of step size, the number of evenly spaced values between the interval is specified.


In [None]:
# Using np.array()
a = np.array([1, 2, 3])
print('a:', a)

# Using np.zeros()
b = np.zeros((3,3))
print('b:', b)

# Using np.ones()
c = np.ones((4,4))
print('c:', c)

# Using np.empty()
d = np.empty((2,3))
print('d:', d)

# Using np.arange()
e = np.arange(10, 30, 5)
print('e:', e)

# Using np.linspace()
f = np.linspace(0, 2, 9)
print('f:', f)


### Accessing and modifying elements

Elements in a numpy array can be accessed and modified using indices and slices, just like Python lists. Here are some examples:


In [None]:
# Declare numpy array
arr = np.array([2, 4, 6, 8, 10])

#Access elements
print('First element:', arr[0])
print('Second element:', arr[1])
print('Last element:', arr[-1])

#Modify elements
arr[0] = 12
arr[-1] = 0
print('Modified array:', arr)


### Basic Operations

 - Arithmetic operators can be used on arrays to perform element-wise operation.

 - The product operator `*` operates elementwise in NumPy arrays. The matrix product can be performed using the `@` operator (in python >=3.5) or the `dot` function or method.

 - Some operations, such as `+=` and `*=`, act in place to modify an existing array rather than create a new one.

 - NumPy provides familiar mathematical functions such as sin, cos, and exp. In NumPy, these are called “universal functions”(ufunc). Within NumPy, these functions operate elementwise on an array, producing an array as output.


In [None]:
# Declare arrays
a = np.array([20, 30, 40, 50])
b = np.arange(4)

# Subtraction
print('Subtraction:', a-b)

# Square
print('Square:', b**2)

# Conditional operations
print('Conditional:', a<35)

# Modify existing array
a += b
print('Modified:', a)


### Numpy functions

Sum, Min, Max, and Cumulative sum:



In [None]:
# Declare arrays
a = np.array([20, 30, 40, 50])

# Sum
print('Sum:', a.sum())

# Min
print('Min:', a.min())

# Max
print('Max:', a.max())

# Cumulative sum
print('Cumulative sum:', a.cumsum())


## Pandas

Pandas is the most popular python library that is used for data analysis. It provides highly optimized performance with back-end source code is purely written in C or Python.

We can analyze data in pandas with:

- Series
- DataFrames

Series is one dimensional(1-D) array defined in pandas that can be used to store any data type.

DataFrames is two-dimensional(2-D) data structure defined in pandas which consists of rows and columns.

### Series

Creating a series by passing a list of values, letting pandas create a default integer index.


In [None]:
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)


### DataFrame

A DataFrame is a two-dimensional array with labeled axes. In general, you can think of it like a dictionary of Series objects, and it's the most commonly used pandas object.

Here are some ways you can create DataFrame.


In [None]:

# Creating a DataFrame by passing a numpy array, with datetime index and labeled columns:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)

# Creating a DataFrame by passing a dict of objects that can be converted to series-like.age = [22, 25, 27, 29, 31]
name = ['Henry', 'Micheal', 'Charles', 'James', 'Andrew']
df = pd.DataFrame({'age': age, 'name': name})
print(df)


## Data Viewing and Selection

Pandas provides various methods to have a look at the data and select it.


In [None]:

## Viewing Data

# Here is how to view the top and bottom rows of the frame:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print('Head:')
print(df.head())

print('
Tail:')
print(df.tail(3))

# Display the index, columns:
print('
Index:')
print(df.index)

print('
Columns:')
print(df.columns)

# DataFrame.to_numpy() gives a NumPy representation of the underlying data.
# Note that this can be an expensive operation when your DataFrame has columns with different data types
print('
To NumPy:')
print(df.to_numpy())

# describe() shows a quick statistic summary of your data:
print('
Describe:')
print(df.describe())

## Selection

# Selecting a single column, which yields a Series, equivalent to df.A:
print('
Column A:')
print(df['A'])

# Selecting via [], which slices the rows.
print('
Slice:')
print(df[1:3])

# For getting a cross section using a label:
print('
Cross section:')
print(df.loc[dates[0]])

# Selecting on a multi-axis by label:
print('
Multi-axis selection:')
print(df.loc[:, ['A', 'B']])

# For getting values with a boolean array:
print('
Boolean array:')
print(df[df > 0])



## Missing Data

Pandas uses the value NaN to represent missing data. It is by default not included in computations. Here are some ways to handle missing data.


In [None]:

## Missing data

# You can drop any rows that have missing data:
df.dropna(how="any")

# You can filling missing data:
df.fillna(value=5)

# To get the boolean mask where values are nan:
pd.isna(df)


## Operations

Pandas provides various methods to perform operations on dataframe.


In [None]:

# Performing descriptive statistics:
df.mean()

# On the other hand, operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension.
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
df.sub(s, axis='index')

# Applying functions to the data:
df.apply(np.cumsum)
df.apply(lambda x: x.max() - x.min())

# Histogramming:
s = pd.Series(np.random.randint(0, 7, size=10))
s.value_counts()


## Merge

Pandas provides various ways to combine Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.



## Grouping

By "group by" we are referring to a process involving one or more of the following steps:

- **Splitting** the data into groups based on some criteria
- **Applying** a function to each group independently
- **Combining** the results into a data structure


In [None]:

# Grouping and then applying a function sum to the resulting groups.
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

print(df.groupby('A').sum())

# Grouping by multiple columns forms a hierarchical index, which we then apply the function.
print(df.groupby(['A','B']).sum())


## Reshaping

There are several ways to reshape and pivot tables in pandas.


In [None]:

#Using a “stack” method on the DataFrame:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]

df2.stack()

#With a “pivot” method you can reshape the data:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})

print(df)
print(df.pivot_table(values='D', index=['A', 'B'], columns=['C']))


## Data Visualization

Pandas uses the plot() method to create diagrams.


In [None]:

# For example, let's create a simple plot:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2022', periods=1000))
ts = ts.cumsum()
ts.plot()

# On DataFrame, the plot() method is a convenience to plot all of the columns with labels:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()

plt.figure()
df.plot()
plt.legend(loc='best')

In [None]:

## Working with Text Data

Pandas provides a set of string functions which make it easy to operate on string data. Most importantly, these functions ignore (or exclude) missing/NaN values.

Let's create a Series with text data and apply some string functions to it:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

# lowercasing the strings
print(s.str.lower())

# uppercasing
print(s.str.upper())

# length of the strings
print(s.str.len())

# splitting text
print(s.str.split())



## Exploratory Data Analysis (EDA)

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Pandas provide many methods to quickly explore your data and generate insightful information. Let's see some basic statistical details like percentile, mean and std of the numerical values of the Series or DataFrame:


In [None]:

df = pd.DataFrame(np.random.randn(8, 5), columns=['A', 'B', 'C', 'D', 'E'])

# provides a statistical summary for all numeric columns
print(df.describe())

# you can also extract one particular metric if you want
print(df.mean())  # mean across all columns
print(df.std())   # standard deviation across all columns


## Control Structures

Control structures are blocks of programming that analyze variables and choose directions in which to go based on given parameters. The two basic types of control structure relevant here are conditionals and loops.

- **Conditionals** (if-else statements): These are used to perform different actions based on different conditions. Here is an example of a simple if-else statement:
```python
x = 10
if x < 10:
    print("x is less than 10")
elif x == 10:
    print("x is exactly 10")
else:
    print("x is greater than 10")
```
- **Loops**: Python has two types of loops - for loops and while loops.

    - A `for` loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string) or other iterable objects. Iterating over a sequence is called traversal. Here is an example of a simple for loop:

    ```python
    for i in range(5):
        print(i)
    ```

    - A `while` loop statement in Python programming language repeatedly executes a target statement as long as a given condition is true. Here is an example of a simple while loop:

    ```python
    i = 0
    while i < 5:
        print(i)
        i += 1
    ```


In [None]:

# Additional examples of string operations

# replacing a substring
print(s.str.replace('a', 'b'))

# check if strings contain a pattern or regular expression
print(s.str.contains('A'))

# count occurrence of a pattern
print(s.str.count('a'))

# get the leading (left) and trailing (right) whitespaces removed
s2 = pd.Series(['  A', 'B  ', '  C  '])
print(s2.str.lstrip())  # leading spaces are removed
print(s2.str.rstrip())  # trailing spaces are removed
print(s2.str.strip())   # both leading and trailing spaces are removed

In [None]:

# Additional EDA examples

# variance of the values
print(df.var())

# correlation between columns
print(df.corr())

# cumulative sum
print(df.cumsum())

# histogram
print(df.hist())

In [None]:

## Additional Control Structures Examples

# Example of a if-elif-else chain
x = 7
if x < 10:
    print("x is less than 10")
elif x < 15:
    print("x is less than 15 but more than or equal to 10")
else:
    print("x is greater than or equal to 15")

# Example of using for loop over a list with an else clause
for i in [1, 2, 3, 4, 5]:
    if i == 3:
        break
    print(i)
else:
    print("The loop has finished iterating over the sequence.")

# Example of a while loop with an else clause
x = 0
while x < 5:
    print(x)
    x += 1
else:
    print("x is no longer less than 5")


## Detailed Explanations

1. **Pandas Operations**: With pandas, you can perform a variety of operations on your data such as calculations (addition, subtraction etc.), conditional operation(s), statistical operations (mean, median etc.), string and even more complex operations. Both series and dataframes support these operations.

2. **Exploratory Data Analysis (EDA)**: Pandas provide several methods for you to quickly understand and explore your data. Some of these methods include `describe` (which provides a quick statistical summary of your data), `head` (which shows the first N rows of your data), `tail` (shows the last N rows), `shape` (shows the number of rows and columns in your data) and many more.

3. **Python Control Structures**: In Python, you have control structures that direct the flow of your program. You have conditional statements (`if-elif-else`) that run certain blocks of code based on specific conditions. Then you also have loops (`for` and `while`) that run a block of code multiple times. The `for` loop often iterates over a sequence (like a list or a string) whereas a `while` loop runs as long as a certain condition is met.



## Detailed Walkthrough of Examples

### Pandas Operations Example Walkthrough:

In the first example, we have a pandas Series named 's':

```python
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
```
which consists of several string (text) values, including names, a string with a special character '@', a number written as a string '1234', and a missing value represented as `np.nan`.

#### Operations:

1. `s.str.lower()`: This operation is converting all the string values in 's' to lower case.

2. `s.str.upper()`: This operation is converting all the string values in 's' to upper case.

3. `s.str.len()`: This operation is calculating the length of each string value in 's' (the number of characters), and returning that as a Series of numbers.

These operations are helping us manipulate and understand the string data in pandas Series or DataFrame. If this were a real dataset, we could use these operations to standardize text values, derive new characteristics from our data, and clean it for further analysis.




### Additional String Operations Examples Walkthrough:

These notable examples focus on more advanced string manipulation techniques.

1. `s.str.replace('a', 'b')`: This operation replaces all occurrences of 'a' in the strings within the Series 's' with 'b'. This could be used when you need to replace specific characters or correct spelling errors across your dataset.

2. `s.str.contains('A')`: This operation checks each string in the Series 's' to see if it contains the character 'A'. It returns a Series of Boolean values (True or False) representing whether 'A' was found in each string. This is particularly useful when you're looking to filter out or examine rows that contain a certain keyword or pattern.

3. `s.str.count('a')`: This operation counts the number of times 'a' appears in each string in the Series 's'. The result is a Series of numbers representing these counts.

Next, we create a new Series 's2':

```python
s2 = pd.Series(['  A', 'B  ', '  C  '])
```

which contains string values with leading and/or trailing whitespaces.

1. `s2.str.lstrip()`: This operation removes leading spaces from each string in the Series 's2'.

2. `s2.str.rstrip()`: This operation removes trailing spaces from each string in the Series 's2'.

3. `s2.str.strip()`: This operation removes both leading and trailing spaces from each string in the Series 's2'.

These operations are useful in data cleaning processes where you need to remove unneeded whitespace from your text data.



### Exploratory Data Analysis (EDA) Examples Walkthrough:

In the EDA examples, we assume we have a pandas DataFrame 'df'. We demonstrate various operations to explore and understand the data contained in 'df'.

1. `df.var()`: This operation calculates the variance of the values in each column in the DataFrame 'df'. Variance is a measure of how spread out the values are in a dataset or a column.

2. `df.corr()`: This operation calculates the correlation between each pair of numerical columns in the DataFrame 'df'. Correlation is a statistical measure that explains how one or more variables are related to each other.

3. `df.cumsum()`: This operation applies a cumulative sum over each column in the DataFrame 'df'. A cumulative sum is a sequence of partial sums of a given sequence.

4. `df.hist()`: This operation plots a histogram for each numerical column in the DataFrame 'df'. A histogram is a graphical representation that organizes a group of data points into a specified range.

These EDA operations provide a quick and effective understanding of the underlying data which is essential before carrying out any further data analysis or data processing tasks.



### Python Control Structures Example Walkthrough:

#### Conditional Statements:

In Python, conditional statements are used to execute certain pieces of code based on specific conditions. You primarily use `if`, `elif` (short for 'else if'), and `else` for this purpose.

```python
day = 'Sunday'

if day == 'Sunday':
    print('Today is the day of the sun.')
elif day == 'Monday':
    print('Today is the day of the moon.')
else:
    print('Today is a weekday.')
```
In this example, we have a variable 'day' that contains the string 'Sunday'. The `if` statement checks if the value of 'day' is 'Sunday'. If it is, 'Today is the day of the sun.' is printed. The `elif` statement checks if the value of 'day' is 'Monday'. If it is, 'Today is the day of the moon.' is printed. If 'day' is neither 'Sunday' nor 'Monday', the `else` statement executes and 'Today is a weekday.' is printed.

#### Loops:

Python provides `for` and `while` loops to iterate over a block of code multiple times.

##### Example of `for` loop:
```python
numbers = [1, 2, 3, 4, 5]

for number in numbers:
    print(number)
```
In this example, we have a list of numbers from 1 to 5. The `for` loop iterates over each item in this list, and the `print` statement inside the loop prints the current item.

##### Example of `while` loop:
```python
i = 0

while i < 5:
    print(i)
    i += 1
```
In this example, we define a variable 'i' with an initial value of 0. The `while` loop keeps running as long as 'i' is less than 5. Within the loop, we first print the current value of 'i', then increment 'i' by 1. Once 'i' becomes 5, the `while` loop stops because the condition 'i < 5' is no longer True.
