<a href="https://colab.research.google.com/github/hewp84/CRT420/blob/main/Pandas_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PANDAS
## Introduction
Pandas is a popular Python library that provides powerful data structures and functions designed specifically for data analysis workflows. The name "Pandas" actually comes from the term "panel data", representing multidimensional data.


## Data Structures

The two most important data structures in Pandas are:

* Series: A Series is a one-dimensional array-like object that can hold many different data types like integers, strings, booleans, etc. What makes a Series unique is that it has an index which assigns a label to each value. The index makes it easier to access, query, and analyze data in the Series. For example, you can use label-based indexing to slice and dice a Series, similar to how you interact with columns in a spreadsheet.

* DataFrame: A DataFrame is a two-dimensional tabular data structure with labeled rows and columns, akin to a spreadsheet, SQL table, or R data frame. It builds on the Series concept by essentially storing a number of Series objects aligned along an index. This enables complex multivariant, relational data analysis. DataFrames have powerful capabilities like easily handling missing data, merging datasets, pivoting data, and more.

Pandas combines these versatile data structures with numerous functions and methods designed to make many aspects of the typical data analysis workflow fast, efficient, and productive in Python. It excels at tasks like loading, cleaning, transforming, merging, reshaping, and visualizing data.

### Series

* A Pandas Series is a one-dimensional array-like data structure that can hold values of different data types such as integers, floats, strings, booleans, etc.
* It has an index which labels each value, like columns in a spreadsheet. This makes Series very useful for data analysis in Python.

The basic method to create a `Series` is to call:

`s = pd.Series(data, index=index)`
#### Creating a Series

In [None]:
import pandas as pd
import numpy as np
# Create from list 
mylist = [1, 2, 3, 4]
myseries1 = pd.Series(mylist)

# Create from numpy array
import numpy as np
arr = np.array([1, 2, 3, 4])
myseries2 = pd.Series(arr) 

# Create from dictionary 
mydict = {'a': 1, 'b': 2, 'c': 3} 
myseries3 = pd.Series(mydict)

# Create from scalar value 
myseries4 = pd.Series(5, index=[0, 1, 2, 3])

# Create from a 1D numpy array
mys = pd.Series(np.array([1,2,3,4]))

In [None]:
print(myseries1)
print(myseries2)
print(myseries3)
print(myseries4)


#### Series Attributes

In [None]:
print(myseries4.values) # The actual data values 
print(myseries4.index) # The index for each value
print(myseries4.dtype) # The data type (int, float, object) 
print(myseries4.name) # Name of the Series 
print(myseries4.shape) # Number of elements

print(mys.values)

#### Accessing elements

In [None]:
#print(myseries1['a']) 
print(myseries1[1])
print(mys[2])

#### Operations with Series

In [None]:
doubled = myseries1 * 2 # Arithmetic
filtered = myseries1[myseries1 > 2] # Filtering 
sorted1 = myseries1.sort_values() # Sorting
total = myseries1.sum() # Aggregation
add_in = mys[1:] + mys[:-1]

print(mys)
print(add_in)

### Dataframes

A DataFrame is a 2-dimensional, tabular data structure with labeled rows and columns similar to a spreadsheet. It can be thought of as a collection of Series objects that share an index. DataFrames are versatile for many types of data analysis in Python because they can store mixed data types (numerals, strings, bools) and provide easy access to data via rows and column labels.

#### Creating Dataframes

Here is one way to reword the explanation of the various inputs accepted by the Pandas DataFrame constructor:

The Pandas DataFrame can be created from many different types of data sources:

* Dictionaries of 1D ndarrays, lists, dicts, or Series - The keys become the DataFrame column names
* 2D NumPy ndarray - Used directly as the DataFrame values
* Structured or record ndarray - Used to populate the DataFrame
* A single Series object - Treated as a single column in the DataFrame
* Another DataFrame - Cloned to make a new DataFrame

When creating a DataFrame, you can also optionally specify the index (row labels) and columns (column names) if desired. Providing these arguments will guarantee the resulting DataFrame contains those exact indexes and columns. For example, passing a dictionary of Series plus a pre-defined list of column names will discard any Series not matching those columns.

If index and column labels are not provided, Pandas will create them automatically based on the input data. For example, dictionaries of ndarrays/lists will use the dict keys as column names. NumPy arrays will get integer indexes and columns.

In [None]:
import pandas as pd 
import numpy as np

# From single Series
data = pd.Series([1, 2, 3, 4]) 
df_from_series = pd.DataFrame(data)

# From list of dicts
data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
df_from_list = pd.DataFrame(data) 

# From dict of lists
data = {'col1': [1, 2], 'col2': [3, 4]}
df_from_dict = pd.DataFrame(data)

# From NumPy array
data = np.array([[1, 2, 3], [4, 5, 6]])
df_from_array = pd.DataFrame(data)

# From reading external file
df_from_file = pd.read_csv('sample.csv')

In [None]:
#Different ways of displaying dataframes
df
print(df)

#### Basic Operations with Dataframe columns

In [None]:
#Column selection
df[col_name]
df[[col1, col2]] 
df.loc[:, 'col1':'col3']

In [None]:
a= df_from_file[['gender','weight']]
a
b = df_from_file.loc[:, 'age':'gender']
b

##### Copy Column

In [None]:
df2 = df.copy()
df2['col_to_copy'] = df['col_to_copy']
#or
df['new_col'] = df['col_to_copy']

In [None]:

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}) 

# Copy entire DataFrame first 
df2 = df.copy()
df2['C'] = df['A']  

# Copy just one column 
df['D'] = df['A']


##### Column Deletion

Use .drop(columns=names, axis=1, inplace=True) to delete columns by name.

* columns (str or list) - Column names to delete
* axis=1 - Operate on columns
* inplace=True - Modify DataFrame in place

In [None]:
df.drop(columns='col1', axis=1, inplace=True)

##### Arithmetic operations with columns

The arithmetic operations are applied element-wise between the columns. Any valid Python arithmetic operator can be used.

The benefit is these operations are automatically vectorized and scaled across the DataFrame without needing to loop through each element. This provides very fast, expressive arithmetic operations on column data.

New columns can also be assigned to the result. The lengths of the columns must match. If they don't, NaN will be introduced as needed.

In [None]:
df_from_file['density'] = df_from_file['weight']/df_from_file['height']

In [None]:
df_from_file