### Data Exploration with Pandas

#### What is Data Exploration?

- Data exploration is the initial step in data analysis, where you understand the structure, quality, and main characteristics of the data.
- It involves summarizing the main features of the data and identifying patterns, relationships, and potential issues.

In [None]:
# Import necessary libraries
import pandas as pd
import seaborn as sns

In [None]:
# Load the Titanic dataset from the seaborn library
titanic = sns.load_dataset('titanic')

In [None]:
# A list of all the datasets available in Seaborn
sns.get_dataset_names()

In [None]:
#1. Head and Tail
# Display the first few rows of the DataFrame
titanic.head()

In [None]:
# Display the last few rows of the DataFrame
titanic.tail()

In [None]:
# 2. Info
# Get a concise summary of the DataFrame
titanic.info()

In [None]:
# 3. Describe
# Generate descriptive statistics
titanic.describe()

In [None]:
# 4. Shape
# Get the number of rows and columns
titanic.shape

In [None]:
# 5. Columns
# Get the column names
titanic.columns

In [None]:
# 6. Isnull and Sum
# Detect missing values and sum them up
titanic.isnull().sum()

In [None]:
# 7. Value_counts
# Get the count of unique values in a column
titanic['sex'].value_counts()

In [None]:
# 8. Fill missing values
titanic['age'].fillna(titanic['age'].median(), inplace=True)
titanic['embarked'].fillna('S', inplace=True)
titanic.isnull().sum()

In [None]:
titanic.query('age == 7')

In [None]:
titanic[titanic['sex'] == 'male']

In [None]:
titanic.query('sex.str.contains("f")')

In [None]:
# Practice Questions:
# 1. Load a dataset of your choice and use the head() method to display the first five rows.
# 2. Use the info() method to get a summary of the dataset.
# 3. Use the describe() method to get descriptive statistics of the dataset.
# 4. Find the number of missing values in each column using isnull() and sum().

### Data Transformation: Why?

- Data transformation is the process of converting data from one format or structure into another.
- It is crucial for preparing the data for analysis, ensuring consistency, and improving the performance of machine learning models.

#### Categorical to Numerical

- Converting categorical variables to numerical values is essential for many machine learning algorithms that require numerical input.

#### Label Encoder Function

- Label Encoding is a technique for converting categorical values to numerical values by assigning a unique integer to each category.

In [None]:
titanic[['sex']].head()

In [None]:
# Example: Label Encoding
# Importing the LabelEncoder class from sklearn's preprocessing module
from sklearn.preprocessing import LabelEncoder as le

# Creating an instance of LabelEncoder
label_encoder = le()

# Applying the label encoder to the 'sex' column of the titanic DataFrame
# This converts the categorical text data ('male', 'female') into numerical labels (e.g., 0, 1)
titanic['sex'] = label_encoder.fit_transform(titanic['sex'])

# Printing the first 5 rows of the transformed 'sex' column to verify the encoding
titanic[['sex']].head()

#### Label Encoder Dictionary

- Creating a dictionary to map original categorical values to numerical values

In [None]:
titanic[['embarked']].tail()

In [None]:
# Example: Label Encoding with a dictionary

# Creating a dictionary to map the 'embarked' column values to numerical labels
# 'C' -> 0, 'Q' -> 1, 'S' -> 2
embarked_mapping = {'C': 0, 'Q': 1, 'S': 2}

# Using the map function to transform the 'embarked' column in the titanic DataFrame
# This replaces the categorical text data ('C', 'Q', 'S') with the corresponding numerical labels
titanic['embarked'] = titanic['embarked'].map(embarked_mapping)

# Printing the last 5 rows of the transformed 'embarked' column to verify the encoding
titanic[['embarked']].tail()

In [None]:
# Practice Questions:
# 1. Create a dictionary to encode the 'class' column of the Titanic dataset.
# 2. Apply the dictionary to the 'class' column using the map() function.

### One Hot Encoding

- One-Hot Encoding is a technique that converts categorical variables into binary (0 or 1) columns for each category.
- This avoids the potential issues of Label Encoding where integer values can mislead algorithms into thinking some categories are greater than others.

In [None]:
titanic['who'].unique()

In [None]:
# Display first 10 rows
titanic.head(10)

In [None]:
# Example: One-Hot Encoding

# Applying one-hot encoding to the 'who' column in the titanic DataFrame
# This creates separate binary columns for each unique value in the 'who' column (e.g., 'who_child', 'who_man', 'who_woman')
# Each binary column will have a value of 1 if the original value was present in that row and 0 otherwise
titanic = pd.get_dummies(titanic, columns=['who'])

# Printing the first 5 rows of the DataFrame to verify the one-hot encoding
titanic.head(10)

In [None]:
# Practice Questions:
# 1. Reload the Titanic dataset and display basic information using info().
# 2. Fill missing values in the 'age' column with the median age and 'embarked' column with the most common value.
# 3. Use LabelEncoder to encode the 'sex' column.
# 4. Create a dictionary to encode the 'embarked' column and apply it.
# 5. Use pd.get_dummies() to one-hot encode the 'class' column.

### More on The Loc and iLoc Function

##### Basic syntax
`DataFrame.loc[row_labels, column_labels]`

In [None]:
import pandas as pd
titanic_df = pd.read_csv('train.csv')
titanic_df.head()

#### 1. Selecting Rows by Label

You can use `loc` to select specific rows based on their index labels

In [None]:
# Select a range of rows by label
titanic_df.loc[2:13]

In [None]:
# Select rows based on a boolean condition
titanic_df.loc[titanic_df['Cabin'] == 'C148']

In [None]:
titanic_df.iloc[[889]]

#### 2. Selecting Columns by Label

You can also use `loc` to select specific columns based on their labels

In [None]:
# Select a single column by label
titanic_df.loc[:, ['Cabin']]

In [None]:
# Select multiple columns by label
titanic_df.loc[:, ['Name', 'Cabin']]

In [None]:
# Select columns based on a boolean condition
titanic_df.loc[:, titanic_df.columns.str.contains('Se')]

In [None]:
titanic_df.columns.str.contains('Se')

#### 3. Selecting Rows and Columns Simultaneously

`loc` allows you to specify both rows and columns in a single operation:

In [None]:
# Select specific rows and columns
titanic_df.loc[[0, 10], ['Name', 'Cabin']]

In [None]:
# Select specific range of rows and columns
titanic_df.loc[0:10, ['Name', 'Cabin']]

In [None]:
# Select rows based on a condition and specific columns
titanic_df.loc[titanic_df['Age'] == 7, ['Name', 'Age', 'Cabin']]

#### 4. Assigning Values with `loc`

You can assign values to specific rows and columns using `loc`:

In [None]:
titanic_df.head(3)

In [None]:
# Assign a value to a specific cell
titanic_df.loc[0, 'Cabin'] = 'C149'

In [None]:
titanic_df.loc[0, 'Cabin']

In [None]:
titanic_df.loc[12:20, 'Cabin']

In [None]:
len(titanic_df.loc[12:20, 'Cabin'])

In [None]:
# Assign values to a slice of the DataFrame
titanic_df.loc[12:20, 'Cabin'] = ['Hufflepuff', 'Ravenclaw', 'Griffindor',
                                  'Slytherin','Ravenclaw', 'Griffindor', 'Hufflepuff', 'Hufflepuff', 'Hufflepuff']

In [None]:
titanic_df.loc[12:20, 'Cabin']

---
_**Your Dataness**_,  
`Obinna Oliseneku` (_**Hybraid**_)  
**[LinkedIn](https://www.linkedin.com/in/obinnao/)** | **[GitHub](https://github.com/hybraid6)**  