#Pandas Learning Plan: Introduction to the pandas Library**
### Valerie's Pandas Resources and

https://www.coursera.org/learn/data-analysis-with-python
https://www.coursera.org/learn/python-for-data-visualization

Official documentation and tutorials:
https://pandas.pydata.org/docs/user_guide/10min.html#min
https://pandas.pydata.org/docs/getting_started/index.html
https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html

Cheat Sheets:
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-for-data-science-in-python
https://www.geeksforgeeks.org/data-wrangling-in-python
https://www.geeksforgeeks.org/pandas-cheat-sheet/
https://www.webpages.uidaho.edu/~stevel/cheatsheets/Pandas%20DataFrame%20Notes_12pages.pdf


Wrangling:
https://www.geeksforgeeks.org/how-to-utilise-pandas-dataframe-and-series-for-data-wrangling/

Pandas GUI ??
https://www.geeksforgeeks.org/data-exploration-using-pandas-gui/


Pandas Definitions and Concepts
**Class Definition:**
A class is like a blueprint for creating objects. In pandas, the core class is the `DataFrame`, which is a two-dimensional table-like data structure with labeled axes (rows and columns).

**Function Definition:**
A function is a reusable block of code that performs a specific task. In pandas, functions are used to manipulate and analyze data stored in DataFrames.

**Method Definition:**
A method is a function that's associated with an object and operates on that object's data. In pandas, DataFrame methods are used to perform various operations on the data within the DataFrame.

**Object Definition:**
An object is an instance of a class. In pandas, a DataFrame object is created to store and manipulate tabular data.

**Learning Steps:**

Explain how to install pandas using tools like `pip`. Show how to import the library into a Python script or Jupyter Notebook.



***Note*** What do we need to do before we **import** pandas????
```python
# Import pandas
import pandas as pd
```

**Step 2: Creating DataFrames**
Show how to create a DataFrame using various data sources like dictionaries, CSV files, and Excel files. Explain how rows and columns are labeled.

```python
# Create a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22]}
df = pd.DataFrame(data)
```

**Step 3: Accessing Data**
Teach how to access specific data using column labels, row indexes, and conditional filtering.

```python
# Access a column
names = df['Name']

# Access a row by index
row = df.loc[0]

# Conditional filtering
young_people = df[df['Age'] < 30]
```

**Step 4: Data Manipulation**
Explain how to add, delete, and modify columns in a DataFrame.

```python
# Add a new column
df['Gender'] = ['Female', 'Male', 'Male']

# Delete a column
df = df.drop(columns=['Gender'])

# Modify values
df.loc[0, 'Age'] = 26
```

**Step 5: Data Analysis**
Introduce basic data analysis tasks using pandas functions and methods.

```python
# Calculate mean age
average_age = df['Age'].mean()

# Count occurrences of each age
age_counts = df['Age'].value_counts()

# Group data and calculate statistics
grouped = df.groupby('Gender')
avg_age_by_gender = grouped['Age'].mean()
```

**Step 6: Data Visualization**
Demonstrate how to create simple visualizations using pandas and libraries like `matplotlib`.

```python
# Import matplotlib for visualization
import matplotlib.pyplot as plt

# Plot a bar chart of age counts
age_counts.plot(kind='bar')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution')
plt.show()
```

**Use Cases:**
1. **Data Cleaning:** Pandas is used to clean and preprocess messy data before analysis.
2. **Data Exploration:** It's used to quickly explore data, calculate basic statistics, and identify patterns.
3. **Data Analysis:** Pandas enables data manipulation, aggregation, and grouping for in-depth analysis.
4. **Data Visualization:** It's used to create basic visualizations for insights.
5. **Data Transformation:** Pandas helps transform data for machine learning and other downstream tasks.

Remember, pandas is a powerful library, and these steps will provide a solid foundation to start working with tabular data. As you progress, you can explore more advanced topics like merging data, time series analysis, and more specialized functions.

In [None]:
Pandas classes (2 main classes)
the `Series` class and the `DataFrame` class.

1. **Series Class:**
   The `Series` class represents a one-dimensional labeled array, similar to a column in a spreadsheet or a single vector of data. Each element in a `Series` has an associated label, referred to as its index. The index allows for easy and efficient data retrieval and alignment.

   Example of creating a `Series`:
   ```python
   import pandas as pd

   data = [10, 20, 30, 40, 50]
   series = pd.Series(data, index=['A', 'B', 'C', 'D', 'E'])
   ```

2. **DataFrame Class:**
   The `DataFrame` class is a two-dimensional labeled data structure, resembling a table with rows and columns. It's the most commonly used data structure in pandas and is suitable for storing and working with tabular data. Each column in a `DataFrame` is a `Series`.

   Example of creating a `DataFrame`:
   ```python
   import pandas as pd

   data = {'Name': ['Alice', 'Bob', 'Charlie'],
           'Age': [25, 30, 22]}
   df = pd.DataFrame(data)
   ```

Besides these core classes, there are a few other classes that are essential for specialized tasks:

3. **Index Class:**
   The `Index` class represents the row and column labels in a `DataFrame`. It's often automatically created and managed by pandas. It enables efficient data retrieval, alignment, and various indexing operations.

   Example of using an `Index`:
   ```python
   import pandas as pd

   data = [10, 20, 30]
   index = pd.Index(['A', 'B', 'C'])
   series = pd.Series(data, index=index)
   ```

4. **DatetimeIndex Class:**
   The `DatetimeIndex` class is a specialized type of index for time series data. It's designed to handle dates and times as index values, allowing for convenient time-based operations and resampling.

   Example of using a `DatetimeIndex`:
   ```python
   import pandas as pd

   dates = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03'])
   data = [10, 20, 30]
   series = pd.Series(data, index=dates)
   ```

These classes provide the building blocks for working with data in pandas. The `Series` class represents individual columns or vectors of data, while the `DataFrame` class combines multiple `Series` objects into a tabular structure. The other classes, such as `Index` and `DatetimeIndex`, enhance the capabilities of pandas for specialized use cases.

In [None]:
##Pandas Modules that provide different functionalities for working with data.

1. **pandas.core**: This is the core module that defines the fundamental data structures of pandas, primarily the `Series` and `DataFrame` classes. These classes form the foundation for data manipulation and analysis.

2. **pandas.io**: This module provides functions to read and write data from various file formats, such as CSV, Excel, JSON, SQL databases, and more. The `read_csv()`, `read_excel()`, and `read_sql()` functions are part of this module.

3. **pandas.plotting**: This module contains functions for creating basic visualizations directly from pandas data structures. It includes functions like `plot()` and `hist()` for quick data exploration.

4. **pandas.groupby**: The `groupby` module is used to perform grouping and aggregation operations on data using the `groupby()` method of DataFrames. This is essential for summarizing and analyzing data based on different categories.

5. **pandas.merge**: This module provides functions for combining or merging DataFrames based on common columns. The `merge()` function is used to perform database-style joins on DataFrames.

6. **pandas.resample**: This module is used for time series data analysis. It includes the `resample()` function, which is used to change the frequency of time series data.

7. **pandas.crosstab**: The `crosstab()` function is used for creating cross-tabulations, which are summary tables showing the relationship between two or more categorical variables.

8. **pandas.pivot_table**: This module contains the `pivot_table()` function, which allows you to create spreadsheet-like pivot tables to summarize and analyze data.

9. **pandas.tools**: This module contains utility functions that were present in earlier versions of pandas but have been deprecated in favor of other functions. You'll generally use the functions from other modules.

10. **pandas.api**: This module provides the public-facing functions and classes that users should access. It serves as an interface to the library and is the main module that users interact with.

11. **pandas.util**: This module provides utility functions and classes for internal use within pandas. Users typically won't directly interact with this module.

Remember, while pandas provides a variety of modules for different tasks, the core functionalities are primarily accessed through the `pandas` module itself. As you learn more about pandas, you'll become more familiar with which modules to use for different tasks.

In [None]:
Methods that you can use to manipulate, analyze, and transform data stored in DataFrames and Series.

**DataFrame Methods:**

1. **head() / tail():** Display the first / last n rows of the DataFrame.
2. **info():** Provide information about the DataFrame, including data types and memory usage.
3. **describe():** Generate summary statistics for numerical columns.
4. **shape:** Get the dimensions (number of rows and columns) of the DataFrame.
5. **columns:** Access the column labels.
6. **index:** Access the row labels.
7. **loc[] / iloc[]:** Access rows and columns by label / integer-based indexing.
8. **at[] / iat[]:** Fast access to single values using label / integer-based indexing.
9. **drop():** Remove specified rows or columns from the DataFrame.
10. **groupby():** Group data based on specified columns and perform aggregation.
11. **pivot_table():** Create a spreadsheet-like pivot table.
12. **sort_values():** Sort rows based on one or more columns.
13. **fillna():** Fill missing values with a specified value or strategy.
14. **apply():** Apply a function to each row or column.
15. **merge() / join():** Combine DataFrames based on specified columns.
16. **to_csv() / to_excel():** Write the DataFrame to a CSV or Excel file.
17. **plot():** Create basic visualizations using matplotlib.

**Series Methods:**

1. **head() / tail():** Display the first / last n elements of the Series.
2. **describe():** Generate summary statistics for numerical data.
3. **index:** Access the index labels.
4. **values:** Access the underlying data as a NumPy array.
5. **unique():** Return unique values in the Series.
6. **nunique():** Count the number of unique values.
7. **sort_values():** Sort the Series values.
8. **value_counts():** Count occurrences of each unique value.
9. **map():** Apply a function to each element.
10. **apply():** Apply a function to each element, similar to map().
11. **isnull() / notnull():** Check for missing / non-missing values.
12. **fillna():** Fill missing values with a specified value or strategy.
13. **astype():** Convert the data type of the Series.
14. **str methods:** Perform string operations on string-type Series.
15. **plot():** Create basic visualizations using matplotlib.

These are just a few examples of the many methods provided by pandas. Each method serves a specific purpose in data manipulation, analysis, transformation, and visualization. As you work with pandas, you'll become more familiar with these methods and their various use cases.

In [None]:
Attributes in the pandas library provide information about the structure, properties, and metadata of data stored in DataFrames and Series.

**DataFrame Attributes:**

1. **shape:** A tuple representing the dimensions (number of rows and columns) of the DataFrame.
2. **index:** The index (row labels) of the DataFrame.
3. **columns:** The column labels of the DataFrame.
4. **dtypes:** Data types of each column in the DataFrame.
5. **values:** The underlying data as a 2D NumPy array.
6. **size:** The total number of elements in the DataFrame.
7. **T:** Transpose the DataFrame (swap rows and columns).
8. **empty:** A boolean indicating whether the DataFrame is empty.
9. **axes:** A list of all the axes (rows and columns) in the DataFrame.
10. **memory_usage():** Memory usage of each column in bytes.
11. **info():** Provide concise summary information about the DataFrame.

**Series Attributes:**

1. **index:** The index (row labels) of the Series.
2. **values:** The underlying data as a 1D NumPy array.
3. **dtype:** Data type of the elements in the Series.
4. **size:** The total number of elements in the Series.
5. **name:** The name of the Series.
6. **shape:** A tuple representing the shape of the Series (number of elements).
7. **axes:** A list containing the Series index.
8. **empty:** A boolean indicating whether the Series is empty.

**Index Attributes:**

1. **name:** The name of the index.
2. **dtype:** Data type of the index elements.
3. **nlevels:** The number of levels in a MultiIndex.

These attributes provide insights into the structure of your data, such as the dimensions, labels, data types, and other metadata. Using these attributes can help you understand and work effectively with the data stored in DataFrames and Series.