## Basics of Matplotlib

### Installing Polars

Before diving into examples, you need to install Polars. You can do this using pip:

```pip install polars```

### Creating a DataFrame

Creating a DataFrame in Polars is straightforward. You can create a DataFrame from a dictionary, list of lists, or even from a CSV file.

In [1]:
import polars as pl

# Use a dictionary to define the dataset
# Why? Dictionaries ensure each column is correctly assigned its own type, avoiding type inference issues.
data = {
    "Name": ["John", "Alice", "Bob"],  # String values
    "Age": [25, 30, 28],  # Integer values (ensures proper numerical operations)
    "Gender": ["Male", "Female", "Male"]  # String values
}

# Create a DataFrame with an explicit schema
# Why? Defining a schema prevents Polars from misinterpreting column types.
df = pl.DataFrame(data, schema={"Name": pl.Utf8, "Age": pl.Int64, "Gender": pl.Utf8})

# Basic data exploration (ensures the DataFrame was created correctly)
print(df)

shape: (3, 3)
┌───────┬─────┬────────┐
│ Name  ┆ Age ┆ Gender │
│ ---   ┆ --- ┆ ---    │
│ str   ┆ i64 ┆ str    │
╞═══════╪═════╪════════╡
│ John  ┆ 25  ┆ Male   │
│ Alice ┆ 30  ┆ Female │
│ Bob   ┆ 28  ┆ Male   │
└───────┴─────┴────────┘


### Basic DataFrame Operations

Polars provides a rich set of functions for data manipulation. Here are some common operations:

#### 1. Filtering and Aggregation
To filter rows based on a condition, use the filter method:

In [2]:
# Filtering and aggregation
# Select only "Age" values where "Gender" is "Male"
male_ages = df.filter(pl.col("Gender") == "Male").select("Age")

# Create another DataFrame with consistent schema
# Why? Ensuring schemas match across all DataFrames prevents concatenation errors.
more_data = {
    "Name": ["Charlie", "Diana"],  # String values
    "Age": [22, 26],  # Integer values
    "Gender": ["Male", "Female"]  # String values
}

# Explicit schema definition for the second DataFrame
another_df = pl.DataFrame(more_data, schema={"Name": pl.Utf8, "Age": pl.Int64, "Gender": pl.Utf8})

# Concatenating DataFrames with the same schema
# Why? If schemas mismatch (e.g., one DataFrame has "Age" as Int64 and another as Utf8), Polars will throw an error.
combined_df = pl.concat([df, another_df], how="vertical")

# Print the combined DataFrame to verify correctness
print(combined_df)

# Calculate the average age of males in the dataset
# Why? Ensuring "Age" is an Int64 column allows us to perform numerical calculations correctly.
average_male_age = male_ages.mean()

# Display the filtered male ages and their average
print(male_ages)
print(average_male_age)

shape: (5, 3)
┌─────────┬─────┬────────┐
│ Name    ┆ Age ┆ Gender │
│ ---     ┆ --- ┆ ---    │
│ str     ┆ i64 ┆ str    │
╞═════════╪═════╪════════╡
│ John    ┆ 25  ┆ Male   │
│ Alice   ┆ 30  ┆ Female │
│ Bob     ┆ 28  ┆ Male   │
│ Charlie ┆ 22  ┆ Male   │
│ Diana   ┆ 26  ┆ Female │
└─────────┴─────┴────────┘
shape: (2, 1)
┌─────┐
│ Age │
│ --- │
│ i64 │
╞═════╡
│ 25  │
│ 28  │
└─────┘
shape: (1, 1)
┌──────┐
│ Age  │
│ ---  │
│ f64  │
╞══════╡
│ 26.5 │
└──────┘


#### 2. Grouping and Aggregation
To group by a column and perform aggregation, use the groupby and agg methods:

In [3]:
# Use `group_by()` instead of `groupby()`
grouped_df = combined_df.group_by("Gender").agg(
    pl.col("Age").mean().alias("Average Age")
)

print(grouped_df)

shape: (2, 2)
┌────────┬─────────────┐
│ Gender ┆ Average Age │
│ ---    ┆ ---         │
│ str    ┆ f64         │
╞════════╪═════════════╡
│ Female ┆ 28.0        │
│ Male   ┆ 25.0        │
└────────┴─────────────┘


#### 3. Selecting Columns

To select specific columns, you can use the select method:

In [4]:
# Select the "Name" and "Age" columns
df_selected = df.select(["Name", "Age"])
print(df_selected)

shape: (3, 2)
┌───────┬─────┐
│ Name  ┆ Age │
│ ---   ┆ --- │
│ str   ┆ i64 │
╞═══════╪═════╡
│ John  ┆ 25  │
│ Alice ┆ 30  │
│ Bob   ┆ 28  │
└───────┴─────┘


#### 5. Adding New Columns

In [5]:
df = df.with_columns((pl.col("Age") + 5).alias("Age_in_5_years"))

print(df)


shape: (3, 4)
┌───────┬─────┬────────┬────────────────┐
│ Name  ┆ Age ┆ Gender ┆ Age_in_5_years │
│ ---   ┆ --- ┆ ---    ┆ ---            │
│ str   ┆ i64 ┆ str    ┆ i64            │
╞═══════╪═════╪════════╪════════════════╡
│ John  ┆ 25  ┆ Male   ┆ 30             │
│ Alice ┆ 30  ┆ Female ┆ 35             │
│ Bob   ┆ 28  ┆ Male   ┆ 33             │
└───────┴─────┴────────┴────────────────┘


#### 5. Sorting Data

In [6]:
# Sort by "Age" in descending order
df_sorted = df.sort("Age", descending=True)

print(df_sorted)


shape: (3, 4)
┌───────┬─────┬────────┬────────────────┐
│ Name  ┆ Age ┆ Gender ┆ Age_in_5_years │
│ ---   ┆ --- ┆ ---    ┆ ---            │
│ str   ┆ i64 ┆ str    ┆ i64            │
╞═══════╪═════╪════════╪════════════════╡
│ Alice ┆ 30  ┆ Female ┆ 35             │
│ Bob   ┆ 28  ┆ Male   ┆ 33             │
│ John  ┆ 25  ┆ Male   ┆ 30             │
└───────┴─────┴────────┴────────────────┘


## Advanced Features: Parallel Processing and Lazy Evaluation
Polars naturally provides parallel processing to expedite calculations and permits lazy evaluation, which may be useful for query plan optimization.

### Lazy Evaluation
Lazy evaluation allows you to build a query plan without executing it immediately. This can lead to significant performance improvements.

In [7]:
# Lazy Evaluation
lazy_df = combined_df.lazy()

# Lazy filtering and aggregation
lazy_male_ages = lazy_df.filter(pl.col("Gender") == "Male").select("Age")
lazy_average_male_age = lazy_male_ages.mean()

# Collect the results (execute the lazy computation)
result = lazy_average_male_age.collect()
print(result)

shape: (1, 1)
┌──────┐
│ Age  │
│ ---  │
│ f64  │
╞══════╡
│ 25.0 │
└──────┘


## Integration with Other Libraries
### Converting to Pandas

In [8]:
# Convert Polars DataFrame to Pandas DataFrame
pandas_df = combined_df.to_pandas()
print(pandas_df)

      Name  Age  Gender
0     John   25    Male
1    Alice   30  Female
2      Bob   28    Male
3  Charlie   22    Male
4    Diana   26  Female


### Converting from Pandas


In [9]:
import pandas as pd

# Create a sample Pandas DataFrame
pandas_data = pd.DataFrame({
    "Name": ["Eve", "Frank"],
    "Age": [27, 35],
    "Gender": ["Female", "Male"]
})

# Convert Pandas DataFrame to Polars DataFrame
polars_df_from_pandas = pl.from_pandas(pandas_data)
print(polars_df_from_pandas)




shape: (2, 3)
┌───────┬─────┬────────┐
│ Name  ┆ Age ┆ Gender │
│ ---   ┆ --- ┆ ---    │
│ str   ┆ i64 ┆ str    │
╞═══════╪═════╪════════╡
│ Eve   ┆ 27  ┆ Female │
│ Frank ┆ 35  ┆ Male   │
└───────┴─────┴────────┘


## Advantages and Disadvantages of Polars

- Performance: The Polars library is renowned for its outstanding functionality. It is designed to quickly and effectively handle huge datasets, often surpassing other Python data manipulation frameworks. Polars make use of vectorized operations and multi-threading to speed up data processing and calculations.
- Expressive Syntax: Complex data transformations and searches are simple to create with Polars because to its succinct and expressive syntax. With the help of the library's chainable and user-friendly API, data scientists may define their data manipulation activities in a comprehensible and unambiguous way.
- Distributed Computing: Polars can process data in a distributed fashion over many nodes because to its built-in support for distributed computing. Its ability to handle huge datasets that would not fit in a single machine's RAM makes it a good match for big data analytics.
Memory Efficient: Memory Efficient Columnar data format lowers memory overhead, making Polars memory-efficient by design. This format optimizes memory utilization and enables quicker calculations by ensuring that only the data needed for a certain operation is loaded into memory.
- Comprehensive Functionality: Aggregation, filtering, sorting, combining, and many more data manipulation and analysis procedures are available with Polars. It is a complete tool for data science activities since it can also handle missing data, data encoding, and data typing.

## Disadvantages of Polars
- Learning Curve: Although Polars provides a clear and expressive syntax, switching from Pandas to Polars may need some learning. Users of the two libraries will need to adjust to new ways of thinking about and dealing with data because of differences in some of the ideas and features.
- Community and Ecosystem: Polars has a smaller ecology and community than larger libraries like Pandas. This implies that the amount of online resources, tutorials, and community assistance is limited, and there are fewer third-party integrations. Nonetheless, the Polars community is expanding, and the data science world is beginning to recognize the library.