# Pandas Functions: Overview, Syntax, and Examples

## pd.Series()

**Usage**: Creates a one-dimensional labeled array (Series) from data.
        
**Parameters**:
- `data`: Array-like, Iterable, dict, or scalar value.
- `index`: Array-like or Index (default is `RangeIndex`).
- `dtype`: Data type of the output Series.
- `name`: Name of the Series.
        

In [1]:
import pandas as pd
import numpy as np

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s1 = pd.Series([1, 3, 5, np.nan, 6, 8], index=['a', 'b', 'c', 'd', 'e', 'f'], name='test')
print(s)
print("*"*20)
print(s1)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
********************
a    1.0
b    3.0
c    5.0
d    NaN
e    6.0
f    8.0
Name: test, dtype: float64


In [2]:
## Accessing elements of a series
print(s[2], s1['b'])

5.0 3.0


## pd.DataFrame()

**Usage**: Creates a two-dimensional labeled data structure (DataFrame) from various data inputs.

**Parameters**:
- `data`: Various forms like dict, array, Series, or DataFrame.
- `index`: Index or array-like to label rows.
- `columns`: Index or array-like to label columns.
- `dtype`: Data type to force.
        

In [3]:
random_data = np.random.randn(6, 4)
print(random_data)

[[-0.68870502  1.05972925  0.70966985 -1.36013954]
 [-1.82899938  0.77027667 -0.6922755   1.49090064]
 [-1.65676179  0.78328087  0.20489654  0.47279722]
 [-0.84354071  0.04436798  0.90298661  0.28607249]
 [-0.2094124   0.10464039  1.99449948 -0.25351467]
 [ 1.78896953 -0.8222358  -0.37859783 -0.63784726]]


In [4]:
dates = pd.date_range("2023-01-01", periods=6)
df = pd.DataFrame(random_data, index=dates, columns=['A', 'B', 'C', 'D'])
print(df)

                   A         B         C         D
2023-01-01 -0.688705  1.059729  0.709670 -1.360140
2023-01-02 -1.828999  0.770277 -0.692275  1.490901
2023-01-03 -1.656762  0.783281  0.204897  0.472797
2023-01-04 -0.843541  0.044368  0.902987  0.286072
2023-01-05 -0.209412  0.104640  1.994499 -0.253515
2023-01-06  1.788970 -0.822236 -0.378598 -0.637847


## pd.date_range()

**Usage**: Generates a fixed frequency DatetimeIndex.

**Parameters**:
- `start`: Start date.
- `end`: End date.
- `periods`: Number of periods to generate.
- `freq`: Frequency string (e.g., 'D' for daily).
        

In [5]:
dates = pd.date_range(start="2023-01-01", end="2023-01-06")
print(dates)

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06'],
              dtype='datetime64[ns]', freq='D')


## df.head() and df.tail()

**Usage**: Returns the first or last `n` rows of the DataFrame.

**Parameters**:
- `n`: Number of rows to return (default is 5).
        

In [6]:
df = pd.DataFrame(np.random.randn(10, 4), columns=list("ABCD"))
print(df.head(3))
print(df.tail(3))

          A         B         C         D
0  1.605023 -1.252496 -1.117942 -0.264162
1 -0.673856  0.394693  0.795585  0.756705
2  1.608609  0.312549 -0.092162  0.165893
          A         B         C         D
7  0.004701  1.107013 -1.389911  2.907592
8 -0.670586  1.603524  0.523739  0.295516
9 -0.992684 -1.137949  1.652399  0.590471


## df.index

**Usage**: Returns the index (row labels) of the DataFrame.

In [7]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
print(df)
print(df.index)

   A  B
a  1  4
b  2  5
c  3  6
Index(['a', 'b', 'c'], dtype='object')


## df.columns

**Usage**: Returns the column labels of the DataFrame.

In [8]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.columns)

Index(['A', 'B'], dtype='object')


## df.to_numpy()

**Usage**: Converts the DataFrame to a NumPy array.

**Parameters**:
- `dtype`: Desired data type of the array.
- `copy`: Whether to ensure a copy is made (default is False).
        

In [9]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)
print(df.to_numpy())

   A  B
0  1  4
1  2  5
2  3  6
[[1 4]
 [2 5]
 [3 6]]


## df.describe()

**Usage**: Generates descriptive statistics of the DataFrame.

**Parameters**:
- `percentiles`: List of percentiles to include.
- `include`: Data types to include.
- `exclude`: Data types to exclude.
        

In [10]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.describe())

         A    B
count  3.0  3.0
mean   2.0  5.0
std    1.0  1.0
min    1.0  4.0
25%    1.5  4.5
50%    2.0  5.0
75%    2.5  5.5
max    3.0  6.0


## df.T

**Usage**: Transposes the DataFrame, swapping rows and columns.

In [11]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.T)

   0  1  2
A  1  2  3
B  4  5  6


## df.sort_index()

**Usage**: Sorts the DataFrame by its index.

**Parameters**:
- `axis`: Axis to sort along (0 for index, 1 for columns).
- `ascending`: Sort ascending vs. descending.
        

In [12]:
df = pd.DataFrame({'A': [3, 100, 2]}, index=['c', 'a', 'b'])
print(df.sort_index())

     A
a  100
b    2
c    3


## df.sort_values()

**Usage**: Sorts the DataFrame by the specified column(s).

**Parameters**:
- `by`: Column label or list of labels to sort by.
- `ascending`: Sort ascending vs. descending.
- `inplace`: If True, perform operation in-place.
        

In [13]:
df = pd.DataFrame({'A': [3, 1, 2]})
print(df.sort_values(by='A'))

   A
1  1
2  2
0  3


## df.loc[]

**Usage**: Accesses a group of rows and columns by labels or a boolean array.
`df.loc[row_labels, column_labels]
`

**Parameters**:
- `row_labels`: Label(s) of rows to select.
- `column_labels`: Label(s) of columns to select.
        

In [14]:
import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data, index=['A', 'B', 'C'])

print(df)
print("*"*30)
# Selecting a single row
row_B = df.loc['B']
print(row_B)


      Name  Age         City
A    Alice   25     New York
B      Bob   30  Los Angeles
C  Charlie   35      Chicago
******************************
Name            Bob
Age              30
City    Los Angeles
Name: B, dtype: object


In [15]:
# Selecting specific columns
subset = df.loc[:, ['Name', 'City']]
print(subset)


      Name         City
A    Alice     New York
B      Bob  Los Angeles
C  Charlie      Chicago


In [16]:
# Selecting a single cell
city_of_B = df.loc['B', 'City']
print(city_of_B)


Los Angeles


In [17]:
# Boolean mask to filter rows where Age > 28
bool_array = df['Age'] > 28
filtered_df = df.loc[bool_array]
print(filtered_df)


      Name  Age         City
B      Bob   30  Los Angeles
C  Charlie   35      Chicago


## df.iloc[]

**Usage**: Accesses a group of rows and columns by integer positions.
`df.iloc[row_positions, column_positions]
`


**Parameters**:
- `row_positions`: Integer position(s) of rows to select.
- `column_positions`: Integer position(s) of columns to select.
        

In [18]:

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)

# Selecting the second row (index 1)
row_1 = df.iloc[1]
print(row_1)


Name            Bob
Age              30
City    Los Angeles
Name: 1, dtype: object


In [19]:
# Selecting first two rows and first two columns
subset = df.iloc[:2, :2]
print(subset)


    Name  Age
0  Alice   25
1    Bob   30


In [20]:
# Selecting the value in the second row, third column
cell_value = df.iloc[1, 2]
print(cell_value)


Los Angeles


In [21]:
# Selecting specific rows and columns
subset = df.iloc[[0, 2], [1, 2]]
print(subset)


   Age      City
0   25  New York
2   35   Chicago


## Indexing with brackets []

Can be used for column selection and filtering, but this is **Not recomended for row selection**.
- `df['col']` when selecting a single column.
- `df[['col1', 'col2']]` when selecting multiple columns.
- `df.loc[]` for label-based selection eg: `df.loc[1, 'col1']`.
- `df.iloc[]` for position-based selection eg: `df.iloc[1, 0]`
- `df[df['col'] > x]` for filtering rows, based on a condition on a column

In [22]:
print(df['Name'])
print("*"*30)
print(df[['Name', 'Age']])
print("*"*30)
print(df[df['Age']>=30]) #df[Boolean Array]

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object
******************************
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
******************************
      Name  Age         City
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


## pd.concat()

**Usage**: Concatenates pandas objects along a particular axis.

**Parameters**:
- `objs`: Sequence or mapping of Series or DataFrame objects.
- `axis`: Axis to concatenate along (0 for index, 1 for columns).
- `join`: How to handle indexes on other axis ('inner' or 'outer').
- `ignore_index`: If True, do not use the index values along the concatenation axis.
        

In [23]:
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})
print(df1)
print("*"*20)
print(df2)
print("*"*20)
print(pd.concat([df1, df2], axis=1))
print("*"*20)
print(pd.concat([df1, df2], axis=0, ignore_index=True))



   A
0  1
1  2
********************
   A
0  3
1  4
********************
   A  A
0  1  3
1  2  4
********************
   A
0  1
1  2
2  3
3  4


## pd.groupby()

**Usage**: Groups DataFrame using a mapper or by a Series of columns.

**Parameters**:
- `by`: Mapping, function, label, or list of labels.
- `axis`: Split along rows (0) or columns (1).
- `level`: Level of MultiIndex.
- `as_index`: For aggregated output, return object with group labels as the index.
- `sort`: Sort group keys.



In [24]:
import pandas as pd

# Sample DataFrame
data = pd.DataFrame({
    'Category': ['A', 'B', 'A', 'B', 'A'],
    'Values': [10, 20, 30, 40, 50]
})

print(data)
print("*"*30)
# Grouping by 'Category' and computing the sum
grouped = data.groupby('Category').sum()
print(grouped)


  Category  Values
0        A      10
1        B      20
2        A      30
3        B      40
4        A      50
******************************
          Values
Category        
A             90
B             60


### Understanding data.groupby('Category')
```python
grouped = data.groupby('Category')
```
`grouped` is a GroupBy object. It doesn’t return a modified DataFrame immediately, but you can iterate over it or apply aggregations.


In [25]:
grouped = data.groupby('Category')
for key, group in grouped:
    print(f"Group: {key}")
    print(group)
    print("*"*20)



Group: A
  Category  Values
0        A      10
2        A      30
4        A      50
********************
Group: B
  Category  Values
1        B      20
3        B      40
********************


### Aggregating Data (sum(), mean(), count())
Once grouped, you can apply aggregation functions like sum, mean, count, etc.

```python
grouped_sum = data.groupby('Category').sum()
```

In [26]:
data.groupby('Category').mean()


Unnamed: 0_level_0,Values
Category,Unnamed: 1_level_1
A,30.0
B,30.0


### Applying Multiple Aggregations (agg())

Instead of a single operation, you can apply multiple functions at once. You will have to pass these function names as string, but keep in mind that eventhough they are passed as list of string, they are predefined names.

i,e., if you pass `['sum', 'pdt']` , it will throw an error.

In [27]:
grouped_agg = data.groupby('Category').agg(['sum','mean', 'count'])
print(grouped_agg)


         Values            
            sum  mean count
Category                   
A            90  30.0     3
B            60  30.0     2


### Grouping Multiple Columns
You can group by multiple columns by passing a list of column names.

In [28]:
df = pd.DataFrame({
    'Category': ['A', 'A', 'B', 'B', 'A', 'B'],
    'Subcategory': ['X', 'Y', 'X', 'Y', 'X', 'X'],
    'Values': [10, 20, 30, 40, 50, 60]
})

grouped = df.groupby(['Category', 'Subcategory']).sum()
print(grouped)


                      Values
Category Subcategory        
A        X                60
         Y                20
B        X                90
         Y                40


### Transform Instead of Aggregation

If you want to keep the original DataFrame shape, use `.transform()` instead of `.agg()`:

Each row gets the mean (or aggregation) of its category without changing the DataFrame shape.


In [29]:
df['Mean_Values'] = df.groupby('Category')['Values'].transform('mean')
print(df)


  Category Subcategory  Values  Mean_Values
0        A           X      10    26.666667
1        A           Y      20    26.666667
2        B           X      30    43.333333
3        B           Y      40    43.333333
4        A           X      50    26.666667
5        B           X      60    43.333333


### Filtering Groups (filter())

You can keep only groups that satisfy a condition:
But that condition should be written as a function that return boolean, and pass that function into `filter`

This function should accept a group, return `True` if the condition satisfies and we want to keep the group, `False` if we want to filter them out.

In [30]:
def greater_than_50(x):
    return x['Values'].sum() > 50

filtered = df.groupby('Category').filter(greater_than_50)
print(filtered)


  Category Subcategory  Values  Mean_Values
0        A           X      10    26.666667
1        A           Y      20    26.666667
2        B           X      30    43.333333
3        B           Y      40    43.333333
4        A           X      50    26.666667
5        B           X      60    43.333333


## Working with CSV

To load csv data use the `read_csv()` function, and to save a dataframe as csv, use `df.to_csv()`.

In [None]:
# eg: 
# data = read_csv('/path/of/csv/file')

# data.to_csv('/path/to/new/csv')