{{< include _include_d2.qmd >}}

In [None]:
#| eval: true
#| echo: false
#| output: false

import pandas as pd
df = pd.read_csv('/home/sol-nhl/rnd/d/cca-cce/csv/iris.tsv', sep='\t')

## Import data files

To read a CSV file into a Pandas DataFrame, you'll first need to install the Pandas library if you haven't already. You can install it using pip with the command `pip install pandas`. Once installed, you can use the `read_csv()` function to load the data from the file into a DataFrame. The function takes the file path as an argument and returns a DataFrame containing the data. Here's how you can read the uploaded file, "iris.csv", into a Pandas DataFrame:

```python
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('/path/to/your/iris.csv')

# Display the first few rows of the DataFrame
print(df.head())
```

Replace `'/path/to/your/iris.csv'` with the actual path where your file is located. This will give you a DataFrame `df` that contains all the data from "iris.csv".

## Selecting and filtering

In Pandas, two basic yet powerful operations are selecting specific columns and filtering rows. To select columns, you can use the syntax `df[['column1', 'column2']]`, which creates a new DataFrame containing only the selected columns. For example:

In [None]:
#| eval: true
#| echo: true
#| output: true

# To select 'sepal_length' and 'species' columns
selected_columns = df[['sepal_length', 'species']]
# Show the first few rows of the resulting DataFrame
selected_columns.head()

To filter rows based on a condition, boolean indexing can be employed. The syntax `df[df['column'] > value]` filters rows where the values in the specified column meet the condition. For instance:

In [None]:
#| eval: true
#| echo: true
#| output: true

# To filter rows where 'sepal_length' is greater than 5
filtered_rows = df[df['sepal_length'] > 5]
# Show the first few rows of the resulting DataFrame
filtered_rows.head()

Both operations return new DataFrames, which can then be used for further analysis.

## Grouping and summarizing

Grouping and summarizing data in Pandas is primarily achieved using the `groupby()` function. This function allows you to group rows based on one or multiple columns, and then you can apply aggregation methods like `mean()`, `sum()`, or `count()` to summarize the data. For instance, if you want to find the average measurements for each species in the `df` DataFrame, you can group by the 'species' column and then apply the `mean()` function to get the average for each numerical column.

Here's an inline code example:

In [None]:
#| eval: true
#| echo: true
#| output: true

# Group by 'species' and calculate the mean for each numerical column
grouped_by_species_mean = df.groupby('species').mean()
# Show the first few rows of the resulting DataFrame
grouped_by_species_mean.head()

This will give you a new DataFrame that contains the summarized data, facilitating easier comparisons between different groups.

## Statistical analysis

Regression analysis is used to explore the relationship between dependent and independent variables. In Python, Scikit-learn is a popular library for performing regression. You typically use the `LinearRegression` class to create a regression model. After separating your features and target variables, you can fit the model using the `fit()` method and make predictions with `predict()`. For example, if you want to predict 'petal_length' based on 'sepal_length' in the `df` DataFrame:

In [None]:
#| eval: true
#| echo: true
#| output: true

from sklearn.linear_model import LinearRegression
import pandas as pd

# Prepare the features and target variable
X = df[['sepal_length']]  # Feature (independent variable)
y = df['petal_length']  # Target (dependent variable)

# Create a LinearRegression object
model = LinearRegression()

# Fit the model to the data
model.fit(X, y)

# Make predictions
predictions = model.predict(X)

# Create a DataFrame to display the analysis result
result_df = pd.DataFrame({'Actual': y, 'Predicted': predictions})
# Show the first few rows of the result DataFrame
result_df.head()

This will create a DataFrame `result_df` that contains both the actual and predicted 'petal_length', facilitating the evaluation of the model's performance.
