{{< include _include_d2.qmd >}}

In [None]:
#| eval: true
#| echo: false
#| output: false

import pandas as pd

#df = pd.read_csv('/home/sol-nhl/rnd/d/cca-cce/csv/iris.tsv', sep='\t')
df = pd.read_csv('https://raw.githubusercontent.com/nils-holmberg/cca-cce/main/csv/iris.csv', sep='\t')

## Import data files

To read a CSV file into a Pandas DataFrame, you'll first need to install the Pandas library if you haven't already. You can install it using pip with the command `pip install pandas`. Once installed, you can use the `read_csv()` function to load the data from the file into a DataFrame. The function takes the file path as an argument and returns a DataFrame containing the data. Here's how you can read the uploaded file, "iris.csv", into a Pandas DataFrame:

```python
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('/path/to/your/iris.csv')

# Display the first few rows of the DataFrame
print(df.head())
```

Replace `'/path/to/your/iris.csv'` with the actual path where your file is located. This will give you a DataFrame `df` that contains all the data from "iris.csv".

## Selecting and filtering

In Pandas, two basic yet powerful operations are selecting specific columns and filtering rows. To select columns, you can use the syntax `df[['column1', 'column2']]`, which creates a new DataFrame containing only the selected columns. For example:

In [None]:
#| eval: true
#| echo: true
#| output: true

# To select 'sepal_length' and 'species' columns
selected_columns = df[['sepal_length', 'species']]
# Show the first few rows of the resulting DataFrame
selected_columns.head()

To filter rows based on a condition, boolean indexing can be employed. The syntax `df[df['column'] > value]` filters rows where the values in the specified column meet the condition. For instance:

In [None]:
#| eval: true
#| echo: true
#| output: true

# To filter rows where 'sepal_length' is greater than 5
filtered_rows = df[df['sepal_length'] > 5]
# Show the first few rows of the resulting DataFrame
filtered_rows.head()

Both operations return new DataFrames, which can then be used for further analysis.

## Grouping and summarizing

Grouping and summarizing data in Pandas is primarily achieved using the `groupby()` function. This function allows you to group rows based on one or multiple columns, and then you can apply aggregation methods like `mean()`, `sum()`, or `count()` to summarize the data. For instance, if you want to find the average measurements for each species in the `df` DataFrame, you can group by the 'species' column and then apply the `mean()` function to get the average for each numerical column.

Here's an inline code example:

In [None]:
#| eval: true
#| echo: true
#| output: true

# Group by 'species' and calculate the mean for each numerical column
grouped_by_species_mean = df.groupby('species').mean()
# Show the first few rows of the resulting DataFrame
grouped_by_species_mean.head()

This will give you a new DataFrame that contains the summarized data, facilitating easier comparisons between different groups.

Let's try another example. Reading the Iris Dataset and Analyzing Sepal Length with Pandas. The Iris dataset, a foundational dataset in data science, comprises measurements of sepals and petals for three iris species. Using the Pandas library in Python, one can effortlessly read and analyze this dataset. To read the dataset into a DataFrame, utilize `pd.read_csv()` if you have a CSV file. With the DataFrame loaded, one can employ the `groupby()` method to group by species, and then use the `mean()` and `std()` functions to compute the mean and standard deviation of the `sepal_length` for each species. Here's a code example that demonstrates the process:

In [None]:
#| eval: true
#| echo: true
#| output: true

import pandas as pd

# Sample data mimicking the Iris dataset structure
data = {
    'sepal_length': [5.1, 4.9, 5.8, 6.4, 5.7],
    'sepal_width': [3.5, 3.0, 2.7, 3.2, 3.0],
    'species': ['setosa', 'setosa', 'virginica', 'virginica', 'versicolor']
}
df_data = pd.DataFrame(data)

# Group by species and calculate mean and standard deviation for sepal_length
means = df_data.groupby('species')['sepal_length'].mean()
std_devs = df_data.groupby('species')['sepal_length'].std()
# Calculate both mean and std for sepal_length grouped by species in one line
stats = df_data.groupby('species')['sepal_length'].agg(['mean', 'std'])

print("Mean Sepal Length by Species:")
print(means)
print("\nStandard Deviation of Sepal Length by Species:")
print(std_devs)

Executing the above code will yield the mean and standard deviation of `sepal_length` for each species in the sample data.

## Statistical analysis

Regression analysis is used to explore the relationship between dependent and independent variables. In Python, Scikit-learn is a popular library for performing regression. You typically use the `LinearRegression` class to create a regression model. After separating your features and target variables, you can fit the model using the `fit()` method and make predictions with `predict()`. For example, if you want to predict 'petal_length' based on 'sepal_length' in the `df` DataFrame:

In [None]:
#| eval: true
#| echo: true
#| output: true

from sklearn.linear_model import LinearRegression
import pandas as pd

# Prepare the features and target variable
X = df[['sepal_length']]  # Feature (independent variable)
y = df['petal_length']  # Target (dependent variable)

# Create a LinearRegression object
model = LinearRegression()

# Fit the model to the data
model.fit(X, y)

# Make predictions
predictions = model.predict(X)

# Create a DataFrame to display the analysis result
result_df = pd.DataFrame({'Actual': y, 'Predicted': predictions})
# Show the first few rows of the result DataFrame
result_df.head()

This will create a DataFrame `result_df` that contains both the actual and predicted 'petal_length', facilitating the evaluation of the model's performance.

## Write data files

In [None]:
#| eval: true
#| echo: true
#| output: true

# write to text file
df.to_csv("../../tmp/some.tsv", sep='\t', index=False)

## Try it yourself!

In [None]:
#| eval: true
#| echo: true
#| output: true

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/nils-holmberg/cca-cce/main/csv/palmerpenguins.tsv', sep='\t')

![palmer penguins](https://nils-holmberg.github.io/cca-cce/res/img/lter_penguins.png){#fig-penguins height=214px width=360px}

**Tasks:**

1. Display the first 5 rows of the dataset to get a quick overview.
2. Show the summary statistics (mean, standard deviation, min, max, etc.) for numerical columns.
3. Determine the number of unique species in the dataset.
4. Filter the dataset to show only Adelie penguins from Torgersen island.
5. Calculate the average bill length of male penguins across all species.
6. Find out the year with the highest recorded average body mass for penguins.
7. Determine the number of missing values in each column.
8. Display all records for penguins with a flipper length greater than 210 mm.
9. Group the data by species and calculate the average bill depth for each group.
10. Count the number of penguins on each island.

In [None]:
#| eval: false
#| echo: false
#| output: false

# Task 1
overview = penguins_df.head()

# Task 2
summary_statistics = penguins_df.describe()

# Task 3
unique_species = penguins_df['species'].nunique()

# Task 4
adelie_torgersen = penguins_df[(penguins_df['species'] == 'Adelie') & (penguins_df['island'] == 'Torgersen')]

# Task 5
average_bill_length_male = penguins_df[penguins_df['sex'] == 'male']['bill_length_mm'].mean()

# Task 6
year_highest_body_mass = penguins_df.groupby('year')['body_mass_g'].mean().idxmax()

# Task 7
missing_values = penguins_df.isnull().sum()

# Task 8
large_flippers = penguins_df[penguins_df['flipper_length_mm'] > 210]

# Task 9
avg_bill_depth_by_species = penguins_df.groupby('species')['bill_depth_mm'].mean()

# Task 10
penguins_by_island = penguins_df['island'].value_counts()

These tasks and their solutions offer a comprehensive introduction to basic data analysis functionalities in Pandas.
