# Introduction to PANDAS

In [None]:
import pandas as pd # imports the pandas library and assigns it an alias of pd for convenience.

In [None]:
#read data file
df=pd.read_csv("yob1900.txt")

This code snippet reads a data file named "yob1900.txt" and stores its contents into a pandas DataFrame object called `df`.

```python
# read data file
df = pd.read_csv("yob1900.txt")
```

The `pd.read_csv()` function is used to read the contents of a CSV (comma-separated values) file and create a DataFrame from it.
3. The filename "yob1900.txt" is passed as an argument to the `pd.read_csv()` function. This assumes that the file is located in the same directory as the script running the code.
4. The resulting DataFrame is assigned to the variable `df`. By convention, `df` is short for "DataFrame" and is a common variable name used to represent a pandas DataFrame object.



In [None]:
df.head() # note the columns are not labelled

Unnamed: 0,Mary,F,16710
0,Helen,F,6343
1,Anna,F,6115
2,Margaret,F,5305
3,Ruth,F,4765
4,Elizabeth,F,4097


A function call to the `head()` method on a DataFrame object called `df`.

Breakdown of what the code and its structure:

- `df`: This is a variable representing a pandas DataFrame object.

- `.`: The dot operator is used to access attributes and methods of an object. In this case, it is used to access the `head()` method of the DataFrame object.

- `head()` is a method provided by pandas DataFrame objects. It returns the first `n` rows of the DataFrame, where `n` is an optional parameter. By default, `n` is set to 5 if not specified.

The purpose of calling `df.head()` is to retrieve a quick preview of the DataFrame's content, specifically the first few rows. This can be useful for understanding the structure of the data and checking if it is loaded correctly.



In [None]:
df=pd.read_csv("yob1900.txt",names=['Name','Gender','Instances']) # adding headers to the dataframe

In [None]:
df.head() # headers added

Unnamed: 0,Name,Gender,Instances
0,Mary,F,16710
1,Helen,F,6343
2,Anna,F,6115
3,Margaret,F,5305
4,Ruth,F,4765


In [None]:
df # the entire dataframe

Unnamed: 0,Name,Gender,Instances
0,Mary,F,16710
1,Helen,F,6343
2,Anna,F,6115
3,Margaret,F,5305
4,Ruth,F,4765
...,...,...,...
3728,White,M,5
3729,Wilhelm,M,5
3730,Winifred,M,5
3731,Woodie,M,5


In [None]:
df.head(10) #  # specifying the number of rows to be viewed

Unnamed: 0,Name,Gender,Instances
0,Mary,F,16710
1,Helen,F,6343
2,Anna,F,6115
3,Margaret,F,5305
4,Ruth,F,4765
5,Elizabeth,F,4097
6,Florence,F,3920
7,Ethel,F,3896
8,Marie,F,3856
9,Lillian,F,3414


In [None]:
df.dtypes

Name         object
Gender       object
Instances     int64
dtype: object


### Explanation:

- The `dtypes` attribute is used to access the data types of each column in the DataFrame.
- When this line of code is executed, it will return the data types of all the columns in the DataFrame `df`.


Functionality:
The code provides information about the data types of the columns in the DataFrame. The data types represent the type of data stored in each column, such as integer, float, string, or datetime, among others.


Example 1:
Suppose you have a DataFrame `df` with the following columns: 'Name', 'Age', 'Height', and 'Weight'. By executing `df.dtypes`, you may get the data types of each column, such as:
```
Name      object
Age        int64
Height   float64
Weight   float64
dtype: object
```
From the output, you can see that 'Name' is of type `object` (likely a string), 'Age' is of type `int64` (integer), 'Height' and 'Weight' are of type `float64` (floating-point numbers).

Example 2:
Consider a DataFrame `df` with columns 'Date', 'Open', 'High', 'Low', 'Close', and 'Volume'. Running `df.dtypes` might yield the following output:
```
Date       object
Open      float64
High      float64
Low       float64
Close     float64
Volume      int64
dtype: object
```
Here, 'Date' is of type `object` (possibly representing dates as strings), 'Open', 'High', 'Low', and 'Close' are of type `float64` (numerical values with decimal points), and 'Volume' is of type `int64` (integer representing volume).

By examining the data types, you can make informed decisions about how to handle and analyze the data, such as applying appropriate mathematical operations or transformations to the columns.

In [None]:
df.info()   #it is very important to understand the characteristics of the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3733 entries, 0 to 3732
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Name       3733 non-null   object
 1   Gender     3733 non-null   object
 2   Instances  3733 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 87.6+ KB


In [None]:
df["Name"].describe()

count     3733
unique    3397
top       Mary
freq         2
Name: Name, dtype: object


### Code Explanation
The code `df["Name"].describe()` performs descriptive statistics on the column labeled "Name" within a Pandas DataFrame, which is represented by the variable `df`.

1. `df["Name"]`: This part of the code accesses the column named "Name" within the DataFrame `df`. The `df["Name"]` expression returns a Pandas Series object that contains the values of the "Name" column.

2. `.describe()`: The `describe()` function is a built-in Pandas function that calculates a summary statistic of a numerical column. When applied to a Pandas Series or DataFrame, it returns a new DataFrame containing descriptive statistics of the column(s).

### Functionality
Retrieves the column labeled "Name" from the DataFrame `df` and generates a summary of descriptive statistics for that column. The resulting output will provide statistical measures such as count, mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value.

### Example and Use Cases
To illustrate the usage and usefulness of this code, consider the following example:

```python
import pandas as pd

# Create a sample DataFrame
data = {
    "Name": ["John", "Alice", "Bob", "Alice", "John", "John", "Bob"],
    "Age": [25, 32, 28, 35, 42, 27, 29],
    "Salary": [50000, 70000, 60000, 80000, 90000, 55000, 65000]
}

df = pd.DataFrame(data)

# Generate descriptive statistics for the "Name" column
name_stats = df["Name"].describe()

print(name_stats)
```

Output:
```
count      7
unique     3
top       John
freq       3
Name: Name, dtype: object
```

In this example, we have a DataFrame `df` with three columns: "Name", "Age", and "Salary". By applying `df["Name"].describe()`, we obtain the descriptive statistics for the "Name" column. The output shows that there are seven entries (count), three unique names (unique), "John" being the most frequent name (top), and it appears three times (freq).

Useful for gaining insights into the distribution of categorical variables, understanding the composition of a column, identifying the most common values, and detecting potential data issues, such as missing or inconsistent values. Overall, the code provides a concise way to obtain descriptive statistics for a specific column.

In [None]:
df.Name.describe()  #an alternate way to access info about a column

count     3733
unique    3397
top       Mary
freq         2
Name: Name, dtype: object

In [None]:
df["Gender"].describe() # categorical values

count     3733
unique       2
top          F
freq      2226
Name: Gender, dtype: object

In [None]:
df["Instances"].describe() # numerial values

count     3733.000000
mean       120.660863
std        549.703202
min          5.000000
25%          7.000000
50%         12.000000
75%         37.000000
max      16710.000000
Name: Instances, dtype: float64

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3733 entries, 0 to 3732
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Name       3733 non-null   object
 1   Gender     3733 non-null   object
 2   Instances  3733 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 87.6+ KB


### Functionality:
The `df.info()` method is primarily used for gaining insights into the structure and contents of a DataFrame. When executed, it prints a summary of the DataFrame's metadata, such as:

1. The total number of rows and columns in the DataFrame.
2. The names and data types of each column.
3. The count of non-null values for each column.
4. The memory usage of the DataFrame.

The printed summary is useful for understanding the data and identifying potential issues or inconsistencies. It allows you to quickly check the data types, detect missing values, and estimate the memory usage of the DataFrame.



In [None]:
Sorted=df.sort_values(['Name'],ascending=True)
Sorted.head()

Unnamed: 0,Name,Gender,Instances
2418,Aaron,M,104
315,Abbie,F,112
1561,Abby,F,7
2499,Abe,M,56
2874,Abel,M,15



### Code Explanation

Performs sorting on a specific column called 'Name' in ascending order and then displays the first few rows of the sorted DataFrame.

Let's break down the code step by step:

1. `Sorted = df.sort_values(['Name'], ascending=True)`: This line of code sorts the DataFrame `df` based on the values in the 'Name' column in ascending order. The `sort_values()` function is used here, and it takes two main parameters:
   - `['Name']`: This specifies the column(s) by which the DataFrame should be sorted. In this case, it's the 'Name' column.
   - `ascending=True`: This parameter determines whether the sorting should be done in ascending or descending order. Here, it's set to `True`, indicating ascending order.

   The sorted DataFrame is assigned to a new variable called `Sorted`.

2. `Sorted.head()`: This line of code displays the first few rows of the sorted DataFrame `Sorted`. By default, the `head()` function displays the first 5 rows. If you want to specify a different number of rows, you can pass an integer value within the parentheses (e.g., `head(10)` to display the first 10 rows).

Use cases for this code could include scenarios where you need to sort a DataFrame based on a specific column, such as sorting a list of students by their names, sorting a list of products by their prices, or sorting a list of cities by their populations. The code provides a convenient way to quickly sort and view the resulting DataFrame.

In [None]:
gender=df.groupby('Gender') #Create a groupby object

This line of code performs a groupby operation on a DataFrame called `df` based on the 'Gender' column. The result of this operation is stored in a variable called `gender`, which represents a groupby object.

The groupby operation in pandas is used to split a DataFrame into groups based on a specified column or set of columns. In this case, the 'Gender' column is used as the criterion for grouping.

The `groupby()` function creates a groupby object, which is an intermediate data structure that allows for further analysis and operations on the grouped data. It  enables various operations on grouped data, such as aggregation, transformation, and filtering. Once we have the groupby object, we can apply these operations to each group or to the entire grouped data.

Here are a few examples of how the `gender` groupby object can be used:

1. Aggregation: We can calculate summary statistics for each group separately. For instance, we can compute the average age of males and females in the DataFrame by applying the `mean()` function to the `gender` groupby object:

   ```python
   gender['Age'].mean()
   ```

   This will return the average age for each gender group.

2. Transformation: We can perform operations on each group independently and return a new DataFrame with the transformed data. For example, we can normalize the 'Salary' column within each gender group by subtracting the mean salary of the group from each value:

   ```python
   normalized_salary = gender['Salary'].transform(lambda x: x - x.mean())
   ```

   The `transform()` function applies the specified transformation function (in this case, a lambda function) to each group separately and returns a new Series with the transformed values.

3. Filtering: We can filter the data based on some condition within each group. For instance, we can select only the rows where the 'Education' column is 'Bachelor's degree' within each gender group:

   ```python
   filtered_data = gender.apply(lambda x: x[x['Education'] == "Bachelor's degree"])
   ```

   The `apply()` function applies the specified function (in this case, a lambda function) to each group and returns a new DataFrame with the filtered rows.



In [None]:
print(gender)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2404d9b370>


In [None]:
df2=gender.sum() # total based on gender category
df2

Unnamed: 0_level_0,Instances
Gender,Unnamed: 1_level_1
F,299873
M,150554


In [None]:
df3=df.groupby('Gender').sum() # alternatively use groupby to sum instances
df3

Unnamed: 0_level_0,Instances
Gender,Unnamed: 1_level_1
F,299873
M,150554


In [None]:
df.groupby('Gender').head()

Unnamed: 0,Name,Gender,Instances
0,Mary,F,16710
1,Helen,F,6343
2,Anna,F,6115
3,Margaret,F,5305
4,Ruth,F,4765
2226,John,M,9834
2227,William,M,8580
2228,James,M,7246
2229,George,M,5405
2230,Charles,M,4102


In [None]:
import pandas as pd
import random

# read the data from the downloaded CSV file.
data = pd.read_csv('https://s3-eu-west-1.amazonaws.com/shanebucket/downloads/uk-500.csv')
# set a numeric id for use as an index for examples.

'''
Generate and add a numeric ID:
data['id'] = [random.randint(0, 1000) for x in range(data.shape[0])]:
Generates a random integer between 0 and 1000 (inclusive) for each row in the DataFrame.
It uses a list comprehension to create a list of random IDs, and then assigns this list as a new column named 'id' in the DataFrame.
'''
data['id'] = [random.randint(0,1000) for x in range(data.shape[0])]

data.head(5)

Unnamed: 0,first_name,last_name,company_name,address,city,county,postal,phone1,phone2,email,web,id
0,Aleshia,Tomkiewicz,Alan D Rosenburg Cpa Pc,14 Taylor St,St. Stephens Ward,Kent,CT2 7PP,01835-703597,01944-369967,atomkiewicz@hotmail.com,http://www.alandrosenburgcpapc.co.uk,117
1,Evan,Zigomalas,Cap Gemini America,5 Binney St,Abbey Ward,Buckinghamshire,HP11 2AX,01937-864715,01714-737668,evan.zigomalas@gmail.com,http://www.capgeminiamerica.co.uk,46
2,France,Andrade,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk,390
3,Ulysses,Mcwalters,"Mcmahan, Ben L",505 Exeter Rd,Hawerby cum Beesby,Lincolnshire,DN36 5RP,01912-771311,01302-601380,ulysses@hotmail.com,http://www.mcmahanbenl.co.uk,130
4,Tyisha,Veness,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk,232


There are two “arguments” to iloc – a row selector, and a column selector.  For example:

In [None]:
# Rows:
data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.
data.iloc[1] # second row of data frame (Evan Zigomalas)
data.iloc[-1] # last row of data frame (Mi Richan)


# Columns:
data.iloc[:,0] # first column of data frame (first_name)
data.iloc[:,1] # second column of data frame (last_name)
data.iloc[:,-1] # last column of data frame (id)

0      117
1       46
2      390
3      130
4      232
      ... 
495    941
496    243
497    589
498    721
499    452
Name: id, Length: 500, dtype: int64

Multiple columns and rows can be selected together using the .iloc indexer.

In [None]:
# Multiple row and column selections using iloc and DataFrame
data.iloc[0:5] # first five rows of dataframe


Unnamed: 0,first_name,last_name,company_name,address,city,county,postal,phone1,phone2,email,web,id
0,Aleshia,Tomkiewicz,Alan D Rosenburg Cpa Pc,14 Taylor St,St. Stephens Ward,Kent,CT2 7PP,01835-703597,01944-369967,atomkiewicz@hotmail.com,http://www.alandrosenburgcpapc.co.uk,117
1,Evan,Zigomalas,Cap Gemini America,5 Binney St,Abbey Ward,Buckinghamshire,HP11 2AX,01937-864715,01714-737668,evan.zigomalas@gmail.com,http://www.capgeminiamerica.co.uk,46
2,France,Andrade,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk,390
3,Ulysses,Mcwalters,"Mcmahan, Ben L",505 Exeter Rd,Hawerby cum Beesby,Lincolnshire,DN36 5RP,01912-771311,01302-601380,ulysses@hotmail.com,http://www.mcmahanbenl.co.uk,130
4,Tyisha,Veness,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk,232


In [None]:
# you generate the output

data.iloc[:, 0:2] # first two columns of data frame with all rows
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1).


Label-based indexing using .loc

Selections using the loc method are based on the index of the data frame (if any). Where the index is set on a DataFrame, using <code>df.set_index()</code>, the .loc method directly selects based on index values of any rows. For example, setting the index of our test data frame to the persons “last_name”:

In [None]:
data.set_index("last_name", inplace=True)
data.head()

Unnamed: 0_level_0,first_name,company_name,address,city,county,postal,phone1,phone2,email,web,id
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Tomkiewicz,Aleshia,Alan D Rosenburg Cpa Pc,14 Taylor St,St. Stephens Ward,Kent,CT2 7PP,01835-703597,01944-369967,atomkiewicz@hotmail.com,http://www.alandrosenburgcpapc.co.uk,117
Zigomalas,Evan,Cap Gemini America,5 Binney St,Abbey Ward,Buckinghamshire,HP11 2AX,01937-864715,01714-737668,evan.zigomalas@gmail.com,http://www.capgeminiamerica.co.uk,46
Andrade,France,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk,390
Mcwalters,Ulysses,"Mcmahan, Ben L",505 Exeter Rd,Hawerby cum Beesby,Lincolnshire,DN36 5RP,01912-771311,01302-601380,ulysses@hotmail.com,http://www.mcmahanbenl.co.uk,130
Veness,Tyisha,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk,232


Now with the index set, we can directly select rows for different “last_name” values using .loc[<label>]  – either singly, or in multiples. For example:

In [None]:
data.loc[["Andrade", 'Veness']]

Unnamed: 0_level_0,first_name,company_name,address,city,county,postal,phone1,phone2,email,web,id
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Andrade,France,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk,390
Veness,Tyisha,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk,232


Select columns with .loc using the names of the columns.

In [None]:
data.loc[["Andrade", 'Veness'], ['first_name', 'address', 'city']]

Unnamed: 0_level_0,first_name,address,city
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Andrade,France,8 Moor Place,East Southbourne and Tuckton W
Veness,Tyisha,5396 Forth Street,Greets Green and Lyng Ward
