# **🐼 Pandas in Data Science**

* **Pandas** is an open-source Python library used for data manipulation and analysis. It provides fast, flexible, and expressive data structures like **Series** and **DataFrame**.

---

## 📦 Importing Pandas

* Used to import the Pandas library, typically with the alias `pd`.

```python
    import pandas as pd
```

---

## 🧱 Data Structures

* Pandas primarily provides two data structures: `Series` (1D) and `DataFrame` (2D).

### 📊 Series (1D labeled array)

```python
    s = pd.Series([1, 3, 5, np.nan, 6, 8])
```

### 🧾 DataFrame (2D labeled data structure)

* A `DataFrame` is a 2D tabular data structure with labeled axes (rows and columns).

```python
    df = pd.DataFrame({
        "Name": ["Alice", "Bob", "Charlie"],
        "Age": [25, 30, 35]
    })
```

#### 📌 DataFrame Syntax:
```python
    pd.DataFrame(
        data=None,      # Data to populate the DataFrame (lists, dicts, arrays, Series, etc.)
        index=None,     # Row labels; if not provided, default integer index is used
        columns=None,   # Column labels; inferred from data if not specified
        dtype=None,     # Data type to force; overrides data types from input
        copy=False      # If True, data is copied even if already a DataFrame or structured type
    )

```

---

## 📂 Reading & Writing Data

* Read and write various file types like CSV, Excel, and JSON using built-in functions.

    | Operation         | Function         | Example                                |
    |------------------|------------------|----------------------------------------|
    | 📥 Read CSV      | `read_csv()`     | `pd.read_csv("data.csv")`              |
    | 📤 Write CSV     | `to_csv()`       | `df.to_csv("output.csv", index=False)` |
    | 🧮 Read Excel     | `read_excel()`   | `pd.read_excel("data.xlsx")`           |
    | 💾 Save to Excel | `to_excel()`     | `df.to_excel("output.xlsx")`           |
    | 🧠 Read JSON     | `read_json()`    | `pd.read_json("data.json")`            |

---

## 🔍 Data Exploration

* Understand and summarize the dataset quickly using these methods.

    | Function          | Description                 |
    |------------------|-----------------------------|
    | `df.head()`       | Shows first 5 rows          |
    | `df.tail()`       | Shows last 5 rows           |
    | `df.info()`       | Summary of dataframe        |
    | `df.describe()`   | Statistical summary         |
    | `df.shape`        | Dimensions (rows, columns)  |
    | `df.columns`      | Column labels               |
    | `df.dtypes`       | Data types of columns       |
    | `df.value_counts()` | Count of unique values   |

---

## 🔧 Data Manipulation

* Transform data using these flexible and powerful methods.

    | Function           | Description               |
    |-------------------|---------------------------|
    | `df["col"]`        | Access a column           |
    | `df[["col1", "col2"]]` | Access multiple columns |
    | `df.loc[]`         | Access by label           |
    | `df.iloc[]`        | Access by index           |
    | `df.assign()`      | Add new columns           |
    | `df.drop()`        | Remove rows/columns       |
    | `df.rename()`      | Rename columns            |
    | `df.sort_values()` | Sort rows                 |
    | `df.set_index()`   | Set index                 |
    | `df.reset_index()` | Reset index               |

---

## 🧼 Data Cleaning

* Clean and prepare data by handling missing values and duplicates.

    | Function              | Description             |
    |----------------------|-------------------------|
    | `df.isnull()`         | Detect missing values   |
    | `df.dropna()`         | Drop missing values     |
    | `df.fillna()`         | Fill missing values     |
    | `df.duplicated()`     | Check for duplicates    |
    | `df.drop_duplicates()`| Drop duplicates         |
    | `df.replace()`        | Replace values          |

---

## 🔄 Filtering & Selection

* Select rows that meet certain conditions using boolean indexing.

```python
    df[df["Age"] > 30]  # Filter rows
    df[(df["Age"] > 25) & (df["Name"] != "Bob")]
```

---

## 📐 Grouping & Aggregation

* Group data and compute summary statistics such as mean, sum, etc.

```python
    df.groupby("Department").mean()
    df.groupby("Department")["Salary"].sum()
```

---

## 🔗 Merging & Joining

* Combine multiple datasets using merge, join, or concatenate.

    | Function      | Description            |
    |---------------|------------------------|
    | `pd.concat()` | Concatenate along axis |
    | `pd.merge()`  | SQL-style joins        |
    | `df.join()`   | Join on index          |

---

## 📊 Visualization (with matplotlib/seaborn)

* Basic data visualizations to explore data trends.

```python
    df["Age"].plot(kind="hist")
    df.plot(x="Name", y="Age", kind="bar")
```

---

## 🧠 Tips & Tricks

* Handy functions and best practices for efficient Pandas use.

- Use `df.sample(n=5)` for random samples 🔀
- Use `df.memory_usage()` to inspect memory usage 📏
- Use `df.apply()` for custom functions 🔄
- Chain methods with `df.pipe()` for clean code 🧼

---

## 🏁 Conclusion

* Pandas is essential for data science tasks like cleaning, exploring, analyzing, and visualizing data. Mastering these functions gives you a solid foundation for handling real-world datasets efficiently! 💪

---

In [None]:
import pandas as pd
import numpy as np

data = {'Product': ['Apple','Banana','Cherry'],
        'Price': [200, 100, 500],
        'Quantity': [10, 20, 30]
}

df = pd.DataFrame(data)
print("DataFrame\n", df)
df['Total'] = df['Price']*df['Quantity']
print("DataFrame with Total\n", df)
print(f"Average Price: {df['Price'].mean()}")
print(df.describe())

In [None]:
d = {'one': pd.Series([1, 2, 3], index = ['a','b','c']),'two': pd.Series([1, 2, 3, 4], index = ['a','b','c','d'])}

df = pd.DataFrame(d)
print("DataFrame with different lengths\n", df)

In [None]:
data = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4, 'c': 5}]

df = pd.DataFrame(data, index = ['first', 'second'], columns = ['a', 'b'])
print("DataFrame with different columns\n", df)

df = pd.DataFrame(data, index = ['first', 'second'], columns = ['a', 'b1'])
print("DataFrame with missing columns\n", df)

In [None]:
data = {'Product': ['Apple','Banana','Cherry', 'Dates', np.nan],
        'Price': [200,np.nan, 100, 500, 200],
        'Quantity': [10, 20, np.nan, 30, 40]
}

df = pd.DataFrame(data)
print("DataFrame\n", df)
print(df.isnull())
print(df.isnull().sum())
df1 = df.dropna()
print("DataFrame after dropping NaN\n", df1)
df.fillna(0, inplace=True)
print("DataFrame after filling NaN with 0\n", df)

# **TASK 1:**
**Problem Statement:**
* Jai needs a program that collects employee details such as Employee ID, Email ID, Experience, Nationality, and Reimbursement. However, due to various reasons, some employees might not provide all the information initially. 
* Jai wants the program to handle missing data gracefully by replacing it with the string 'none' in the final output.
---
**Input format:**

* The input consists of five lines.
* Each line contains a hyphen-separated list of values corresponding to employee details.
* The order of information across the lines corresponds to the following:
  - Employee IDs (string)
  - Email IDs (string)
  - Experience values (integer)
  - Nationality information (string)
  - Reimbursement amounts (integer)
* **Note:** If any detail is missing for an employee, it should be represented by an empty field separated by a hyphen (-).
---
**Output format:**

* The output displays the data frame in tabular form with columns for 'Employee ID', 'Email ID', 'Experience', 'Nationality', and 'Reimbursement'.

* **Note:** If the Email ID, Nationality, Experience, or Reimbursement value is missing in the input, it must be replaced with 'none' in the corresponding cell of the output table. The 'Employee ID' is guaranteed to be non-empty.
---
**Code constraints:**

1. Employee IDs are non-empty strings and unique.
2. Experience and Reimbursement values can be integers or empty.
3. Email IDs and Nationalities can be strings or empty.
4. The number of values in each input line will be the same, corresponding to the number of employees.
---
**Sample test case:**

* **Input:**

  - 1234-5678-9012
  - john@example.com-mary@-
  - -3-7
  - USA--UK
  - 1000--1200

* **Output:**

  |   | Employee ID |          Email ID | Experience |  Nationality | Reimbursement |
  |---|-------------|-------------------|------------|--------------|---------------|
  |0  |        1234 |  john@example.com |       none |          USA |          1000 |
  |1  |        5678 |  mary@example.com |          3 |         none |          none |
  |2  |        9012 |              none |          7 |           UK |          1200 |

In [None]:
import pandas as pd

def employee_details():
    emp_id = input().strip().split('-')
    email_id = input().strip().split('-')
    exp = input().strip().split('-')
    nation = input().strip().split('-')
    reimbur = input().strip().split('-')
    details=[]
    for i in range(len(emp_id)):
        detail = {'Employee ID': emp_id[i] if emp_id[i] else 'none',
                  'Email ID': email_id[i] if email_id[i] else 'none',
                  'Experience': exp[i] if exp[i] else 'none',
                  'Nationality': nation[i] if nation[i] else 'none',
                  'reimbursement': reimbur[i] if reimbur[i] else 'none'}
        details.append(detail)
    return details
print(pd.DataFrame(employee_details()))
