# Data Analysis using Pandas

## Questions

> What is the relationship between NumPy series and Pandas DataFrames?
> What makes indexes in Pandas different from the rest of the data?
> How do you access rows, columns, and values in a DataFrame using index and using boolean conditions?
> How do you get descriptive statistics, and how can you filter ("slice") those statistics by row, column, or value?

### What is Pandas?

> 'Pandas' is the Python Data Analyis Library. It is used to transfer data to and from Excel.

### DataFrame vs. Series:
> A **DataFrame** is like a 2D array but with labeled rows and columns, making it more versatile. Each column can hold different types of data, unlike NumPy arrays. When you select a single column from a DataFrame, it returns a one-dimensional **Series**, which contains only one type of data. 

In [3]:
import pandas as pd

In [4]:
data = [["Mark", 55, "Italy", 4.5, "Europe"],
        ["John", 33, "USA", 6.7, "America"],
        ["Tim", 41, "USA", 3.9, "America"],
        ["Jenny", 12, "Germany", 9.0, "Europe"]]

df = pd.DataFrame(data=data,
                  columns=["name", "age", "country", "score", "continent"],
                  index=[1001, 1000, 1002, 1003])
df

Unnamed: 0,name,age,country,score,continent
1001,Mark,55,Italy,4.5,Europe
1000,John,33,USA,6.7,America
1002,Tim,41,USA,3.9,America
1003,Jenny,12,Germany,9.0,Europe


> The 'df.info()' method gives a summary of the DataFrame, including the total number of entries, data types, and memory usage. This is a quick way to understand the structure of your DataFrame.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1001 to 1003
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   name       4 non-null      object 
 1   age        4 non-null      int64  
 2   country    4 non-null      object 
 3   score      4 non-null      float64
 4   continent  4 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 192.0+ bytes


> In Pandas, the row labels are called the index, which serves as the unique identifier for each row. By default, Pandas uses integers starting from 0, but you can customize this.

In [6]:
df.index
df.index.name = "user_id"
df

Unnamed: 0_level_0,name,age,country,score,continent
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,Mark,55,Italy,4.5,Europe
1000,John,33,USA,6.7,America
1002,Tim,41,USA,3.9,America
1003,Jenny,12,Germany,9.0,Europe


> You can reset the index (to default integers) or set a new column as the index. For example:

In [12]:
df.reset_index()
df.reset_index().set_index("name")

Unnamed: 0_level_0,user_id,age,country,score,continent
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mark,1001,55,Italy,4.5,Europe
John,1000,33,USA,6.7,America
Tim,1002,41,USA,3.9,America
Jenny,1003,12,Germany,9.0,Europe


> To reorder the rows of a DataFrame, you can reindex by specifying a new set of index values. Sorting can be done either by row index or by the values in a specific column:

In [14]:
df.reindex([999, 1000, 1001, 1004])
df.sort_index()
df.sort_values(["continent", "age"])

properties,name,age,country,score,continent
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1000,John,33,USA,6.7,America
1002,Tim,41,USA,3.9,America
1003,Jenny,12,Germany,9.0,Europe
1001,Mark,55,Italy,4.5,Europe


> You can retrieve the column names of a DataFrame using df.columns, and even rename the columns for better clarity:

In [13]:
df.columns
df.columns.name = "properties"

> You can rename the columns of your DataFrame using df.rename(). This is useful for making your DataFrame more readable:

In [15]:
df.rename(columns={"name": "First Name", "age": "Age"})

properties,First Name,Age,country,score,continent
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,Mark,55,Italy,4.5,Europe
1000,John,33,USA,6.7,America
1002,Tim,41,USA,3.9,America
1003,Jenny,12,Germany,9.0,Europe


> You can load data into Pandas directly from Excel using the following command:

In [None]:
import pandas as pd
pd.read_excel("../DogAccessories.xlsx")