## DataFrame

A DataFrame is a two-dimensional tabular data structure that organizes data in rows and columns. It is similar to a spreadsheet or a SQL table.

Pandas provides a DataFrame class that allows you to manipulate and analyze data efficiently. You can think of a DataFrame as a collection of Series objects, where each column represents a Series. It provides a wide range of functions and methods for data manipulation, filtering, grouping, and more.

Here's an example to help you understand:

```python
import pandas as pd

# Create a dictionary with sample data
data = {
    'Name': ['John', 'Emma', 'Alex', 'Olivia'],
    'Age': [25, 28, 30, 27],
    'City': ['New York', 'Paris', 'London', 'Sydney']
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the DataFrame
print(df)
```

Output:
```
    Name  Age       City
0   John   25   New York
1   Emma   28      Paris
2   Alex   30     London
3  Olivia   27     Sydney
```

In this example, we created a DataFrame called `df`. It consists of three columns: `'Name'`, `'Age'`, and `'City'`. Each column is represented by a Series object, and together they form the DataFrame.

You can access columns, perform operations, and apply various transformations on the DataFrame. For instance, you can access a specific column by using `df['ColumnName']` or perform computations like `df['Age'] + 1` to increment the age of each person by 1.

The DataFrame is a powerful data structure that provides a convenient way to work with structured data in Python, making it easier to perform data analysis and manipulation tasks.

In [1]:
import pandas as pd

data = {
    'Name': ['John', 'Emma', 'Alex', 'Olivia'],
    'Age': [25, 28, 30, 27],
    'City': ['New York', 'Paris', 'London', 'Sydney']
}

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Emma,28,Paris
2,Alex,30,London
3,Olivia,27,Sydney


In [2]:
df['Name']

0      John
1      Emma
2      Alex
3    Olivia
Name: Name, dtype: object

In [3]:
type(df['Name'])

pandas.core.series.Series

### Creating DataFrames

There are several ways to create a DataFrame, which is a two-dimensional tabular data structure commonly used in data manipulation and analysis. Here are some common methods:

1. From a List of Lists or Numpy Arrays:
```python
import pandas as pd
import numpy as np

data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
df = pd.DataFrame(data)
```
In this example, we create a DataFrame from a list of lists. Each inner list represents a row of data, and the columns are automatically labeled with integers.

2. From a Dictionary:
```python
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
```
Here, we create a DataFrame from a dictionary. The keys of the dictionary represent the column names ('Name', 'Age', 'City'), and the values represent the data for each column.

3. From a Dictionary of Series:
```python
import pandas as pd

data = {'Name': pd.Series(['Alice', 'Bob', 'Charlie']),
        'Age': pd.Series([25, 30, 35]),
        'City': pd.Series(['New York', 'London', 'Paris'])}

df = pd.DataFrame(data)
```
In this example, we create a DataFrame by passing a dictionary of Series to the `pd.DataFrame()` function. Each Series represents the data for a particular column.

4. From a CSV or Excel File (We will see more examples of this way later):
```python
import pandas as pd

df = pd.read_csv('data.csv')  # Read from a CSV file
df = pd.read_excel('data.xlsx')  # Read from an Excel file
```
These examples demonstrate how to create a DataFrame by reading data from CSV or Excel files using the `pd.read_csv()` and `pd.read_excel()` functions respectively.

5. From a SQL Database:
```python
import pandas as pd
import sqlite3

conn = sqlite3.connect('database.db')
query = 'SELECT * FROM table_name'
df = pd.read_sql(query, conn)
```
Here, we establish a connection to a SQL database and execute a SQL query to fetch data into a DataFrame using the `pd.read_sql()` function.

These examples showcase different ways to create a DataFrame in pandas, enabling you to choose the most suitable method based on your data source and requirements.

In [1]:
import pandas as pd
import numpy as np

data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
df = pd.DataFrame(data)

df

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9


In [5]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Paris']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,London
2,Charlie,35,Paris


In [16]:
import pandas as pd

names = pd.Series(['Alice', 'Bob', 'Charlie'])
ages = pd.Series([25, 30, 35])
cities = pd.Series(['New York', 'London', 'Paris'])

data = {
    'Name': names,
    'Age': ages,
    'City': cities
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,London
2,Charlie,35,Paris


### Selecting data from DataFrames

Consider the following DataFrame:

```python
import pandas as pd

data = {
    'Name': ['John', 'Alice', 'Bob', 'Emily', 'David'],
    'Age': [25, 28, 22, 30, 27],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
    'Salary': [50000, 60000, 45000, 70000, 55000]
}

df = pd.DataFrame(data)
```

1. **Selecting Columns:**

    - To select a single column, you can use the indexing operator `[]` with the column name as a string. For example:

        ```python
        name_column = df['Name']
        print(name_column)
        ```

        Output:
        ```
        0     John
        1    Alice
        2      Bob
        3    Emily
        4    David
        Name: Name, dtype: object
        ```

    - To select multiple columns, pass a list of column names within the indexing operator. For example:
    
        ```python
        subset = df[['Name', 'Age']]
        print(subset)
        ```
    
        Output:
        ```
            Name  Age
        0   John   25
        1  Alice   28
        2    Bob   22
        3  Emily   30
        4  David   27
        ```

2. **Selecting Rows:**

    - To select rows based on their index, you can use the `loc` indexer. For example, to select the row with index `3`:
        ```python
        row = df.loc[3]
        print(row)
        ```

        Output:
        ```
        Name      Emily
        Age          30
        City      Tokyo
        Salary    70000
        Name: 3, dtype: object
        ```

    - To select rows based on their position, you can use the `iloc` indexer. For example, to select the third row:
    
        ```python
        row = df.iloc[2]
        print(row)
        ```
    
        Output:
        ```
        Name         Bob
        Age           22
        City       Paris
        Salary     45000
        Name: 2, dtype: object
        ```

    - To select rows based on a specific condition, you can use boolean indexing. For example, to select rows where the age is greater than 25:
    
        ```python
        subset = df[df['Age'] > 25]
        print(subset)
        ```
    
        Output:
        ```
            Name  Age     City  Salary
        1  Alice   28   London   60000
        3  Emily   30    Tokyo   70000
        4  David   27   Sydney   55000
        ```

3. **Selecting Subsets of Data:**

    - To select both rows and columns simultaneously, you can use the `loc` indexer. For example, to select the name and city columns for rows where the age is less than 30:
    
        ```python
        subset = df.loc[df['Age'] < 30, ['Name', 'City']]
        print(subset)
        ```
    
        Output:
        ```
            Name      City
        0   John  New York
        1  Alice    London
        2    Bob     Paris
        4  David    Sydney
        ```

4. **Other Selection Techniques:**

    - You can use methods like `head(n)` and `tail(n)` to select the first `n` rows or the last `n` rows, respectively. For example:

        ```python
        first_three_rows = df.head(3)
        print(first_three_rows)
        ```
        
        Output:
        ```
            Name  Age      City  Salary
        0   John   25  New York   50000
        1  Alice   28    London   60000
        2    Bob   22     Paris   45000
        ```
    
    - You can also use the `query()` method to select rows based on a specific condition in a more expressive and concise way. For example, to select rows where the salary is greater than 50000:
    
        ```python
        subset = df.query('Salary > 50000')
        print(subset)
        ```
        
        Output:
        ```
            Name  Age     City  Salary
        1  Alice   28   London   60000
        3  Emily   30    Tokyo   70000
        4  David   27   Sydney   55000
        ```

These examples demonstrate various ways to select data in a Pandas DataFrame. You can adapt these techniques based on your specific data analysis requirements.

In [18]:
import pandas as pd

data = {
    'Name': ['John', 'Alice', 'Bob', 'Emily', 'David'],
    'Age': [25, 28, 22, 30, 27],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
    'Salary': [50000, 60000, 45000, 70000, 55000]
}

df = pd.DataFrame(data)

In [22]:
df['Age']

0    25
1    28
2    22
3    30
4    27
Name: Age, dtype: int64

In [23]:
df[['Name', 'Age']]

Unnamed: 0,Name,Age
0,John,25
1,Alice,28
2,Bob,22
3,Emily,30
4,David,27


In [24]:
df.loc[3]

Name      Emily
Age          30
City      Tokyo
Salary    70000
Name: 3, dtype: object

In [26]:
df.iloc[2]

Name        Bob
Age          22
City      Paris
Salary    45000
Name: 2, dtype: object

> **You can also use `iloc` and `loc` like this**

In [28]:
df.loc[0, 'Name']

'John'

In [30]:
df.iloc[0, 1]

25

### Mutating DataFrames

In Pandas, there are several ways to mutate or modify a DataFrame to add, update, or delete data. Here are some common methods:

1. Adding Columns:
   - To add a new column to a DataFrame, you can simply assign a new Series or a NumPy array to a new column name. For example:

     ```python
     import pandas as pd

     data = {
         'Name': ['John', 'Alice', 'Bob'],
         'Age': [25, 28, 22]
     }

     df = pd.DataFrame(data)

     # Adding a new column called 'City'
     df['City'] = ['New York', 'London', 'Paris']

     print(df)
     ```

     Output:
     ```
        Name  Age       City
     0   John   25   New York
     1  Alice   28     London
     2    Bob   22      Paris
     ```

2. Updating Values:
   - To update values in a DataFrame, you can use various methods like indexing, boolean conditions, or the `loc` indexer. For example, to update the 'Age' column based on a condition:

     ```python
     # Updating the 'Age' column where 'Name' is 'John'
     df.loc[df['Name'] == 'John', 'Age'] = 26

     print(df)
     ```

     Output:
     ```
        Name  Age       City
     0   John   26   New York
     1  Alice   28     London
     2    Bob   22      Paris
     ```

3. Deleting Columns:
   - To delete a column from a DataFrame, you can use the `drop()` method and specify the column name and axis. The default axis is 0 (rows), so you need to set `axis=1` to delete a column. For example:

     ```python
     # Deleting the 'City' column
     df = df.drop('City', axis=1)

     print(df)
     ```

     Output:
     ```
        Name  Age
     0   John   26
     1  Alice   28
     2    Bob   22
     ```

4. Deleting Rows:
   - To delete rows based on a condition, you can use boolean indexing to select the rows you want to keep, effectively filtering out the rows you want to delete. For example:

     ```python
     # Deleting rows where 'Name' is 'Alice'
     df = df[df['Name'] != 'Alice']

     print(df)
     ```

     Output:
     ```
        Name  Age
     0   John   26
     2    Bob   22
     ```

These are some common methods for mutating a DataFrame in Pandas. Remember to assign the modified DataFrame back to a variable if you want to keep the changes.

In [33]:
import pandas as pd

data = {
    'Name': ['John', 'Alice', 'Bob', 'Emily', 'David'],
    'Age': [25, 28, 22, 30, 27],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
    'Salary': [50000, 60000, 45000, 70000, 55000]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City,Salary
0,John,25,New York,50000
1,Alice,28,London,60000
2,Bob,22,Paris,45000
3,Emily,30,Tokyo,70000
4,David,27,Sydney,55000


In [34]:
df.loc[3, 'Age'] = 33
df

Unnamed: 0,Name,Age,City,Salary
0,John,25,New York,50000
1,Alice,28,London,60000
2,Bob,22,Paris,45000
3,Emily,33,Tokyo,70000
4,David,27,Sydney,55000


In [35]:
df['Children'] = [0, 1, 0, 2, 1]
df

Unnamed: 0,Name,Age,City,Salary,Children
0,John,25,New York,50000,0
1,Alice,28,London,60000,1
2,Bob,22,Paris,45000,0
3,Emily,33,Tokyo,70000,2
4,David,27,Sydney,55000,1


In [36]:
df.drop(0)

Unnamed: 0,Name,Age,City,Salary,Children
1,Alice,28,London,60000,1
2,Bob,22,Paris,45000,0
3,Emily,33,Tokyo,70000,2
4,David,27,Sydney,55000,1


In [37]:
df.drop('Name', axis=1)

Unnamed: 0,Age,City,Salary,Children
0,25,New York,50000,0
1,28,London,60000,1
2,22,Paris,45000,0
3,33,Tokyo,70000,2
4,27,Sydney,55000,1


> **Note that drop returns a new DataFrame with the specified column or row dropped, if you want to mutate the DataFrame set the pass `inplace=True` as argument** 

In [39]:
df.drop('Children', axis=1, inplace=True)
df

Unnamed: 0,Name,Age,City,Salary
0,John,25,New York,50000
1,Alice,28,London,60000
2,Bob,22,Paris,45000
3,Emily,33,Tokyo,70000
4,David,27,Sydney,55000
