# Pandas

Pandas, like NumPy, is one of the most popular Python libraries for data analysis. It is a high-level abstraction over low-level NumPy, which is written in pure C.

Pandas provides high-performance, easy-to-use data structures and data analysis tools. There are two main structures used by pandas; data frames and series.

* A pandas series is similar to a list, but differs in the fact that a series associates a label with each element. This makes it look like a dictionary.
* If an index is not explicitly provided by the user, pandas creates a RangeIndex ranging from 0 to N-1.
* Each series object also has a data type.



In [1]:
import pandas as pd

new_series = pd.Series([5,6,7,8,9,19])
print(new_series)

0     5
1     6
2     7
3     8
4     9
5    19
dtype: int64


In [2]:
import pandas as pd

new_series = pd.Series([5,6,7,8,9,19])
print(new_series.values)

print('-----------------')
print(new_series[4])

[ 5  6  7  8  9 19]
-----------------
9


You can provide an index manually

In [3]:
import pandas as pd

new_series = pd.Series([5,6,7,8,9,19], index=['a','b','c','d','e','f'])
print(new_series.values)

print('-----------------')
print(new_series['f'])

[ 5  6  7  8  9 19]
-----------------
19


It is easy to retrieve several elements of a series by their indices or make group assignments.

In [4]:
import pandas as pd

new_series = pd.Series([5,6,7,8,9,19], index=['a','b','c','d','e','f'])
print(new_series.values)

print('-----------------')
new_series[['a','b','f']] = 0
print(new_series)

[ 5  6  7  8  9 19]
-----------------
a    0
b    0
c    7
d    8
e    9
f    0
dtype: int64


# Filtering and maths operations

In [5]:
import pandas as pd

new_series = pd.Series([5,6,7,8,9,19], index=['a','b','c','d','e','f'])
new_series2 = new_series[new_series> 6]
print (new_series2)
print('-----------------')
new_series2 = new_series[new_series >= 6]*2
print(new_series2)

c     7
d     8
e     9
f    19
dtype: int64
-----------------
b    12
c    14
d    16
e    18
f    38
dtype: int64


# Pandas data frame

A data frame is a table, with rows and columns. 
* Each column in a data frame is a series object.
* Rows consist of elements inside series.
![image.png](attachment:image.png)

# Creating a Pandas data frame

Pandas data frames can be constructed using Python dictionaries.


In [6]:
import pandas as pd

# Define the data
data = {
    'country': ['Vietnam', 'Cambodia', 'Laos'],
    'population': [97338579, 16718971, 7275556],  # Population values
    'square': [331212, 181035, 236800]  # Square kilometer values
}

# Create the DataFrame
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

    country  population  square
0   Vietnam    97338579  331212
1  Cambodia    16718971  181035
2      Laos     7275556  236800


You can ascertain the type of a column with the type() function

In [7]:
print(type(df['country']))

<class 'pandas.core.series.Series'>


In [8]:
import pandas as pd

list2 = [[0,1,2],[3,4,5],[6,7,8]]

df = pd.DataFrame(list2)
print(df)

df.columns = ['V1', 'V2', 'V3']
print(df)

   0  1  2
0  0  1  2
1  3  4  5
2  6  7  8
   V1  V2  V3
0   0   1   2
1   3   4   5
2   6   7   8


A Pandas data frame object as two indices; a column index and row index. If you do not provide one, Pandas will create a RangeIndex from 0 to N-1.

In [9]:
import pandas as pd

# Define the data
data = {
    'country': ['Vietnam', 'Cambodia', 'Laos'],
    'population': [97338579, 16718971, 7275556],  # Population values
    'square': [331212, 181035, 236800]  # Square kilometer values
}

# Create the DataFrame
df = pd.DataFrame(data)

print(df.columns)
print(df.index)

Index(['country', 'population', 'square'], dtype='object')
RangeIndex(start=0, stop=3, step=1)


# Indexing

There are numerous ways to provide row indices explicitly. For example, you could provide an index when creating a data frame:

In [10]:
import pandas as pd

# Define the data
data = {
    'country': ['Vietnam', 'Cambodia', 'Laos'],
    'population': [97338579, 16718971, 7275556],  # Population values
    'square': [331212, 181035, 236800]  # Square kilometer values
}

# Create the DataFrame with custom index
df = pd.DataFrame(data, index=['VN', 'KH', 'LA'])

print(df)
print(df.columns)
print(df.index)


     country  population  square
VN   Vietnam    97338579  331212
KH  Cambodia    16718971  181035
LA      Laos     7275556  236800
Index(['country', 'population', 'square'], dtype='object')
Index(['VN', 'KH', 'LA'], dtype='object')


loc and iloc are used for accessing data in DataFrames. They serve different purposes and are used in different contexts.

* loc
    - Purpose: loc is used for label-based indexing. It allows you to select rows and columns by their labels (names).
    - Usage: loc[row_label, column_label]

In [11]:
# Using loc to access data
print(df.loc['VN'])  # Access the row with index 'VN'

country        Vietnam
population    97338579
square          331212
Name: VN, dtype: object


In [12]:
print(df.loc['VN', 'population'])  # Access the 'population' column for the row with index 'VN'

97338579


In [13]:
print(df.loc[:, 'country'])  # Access all rows for the 'country' column

VN     Vietnam
KH    Cambodia
LA        Laos
Name: country, dtype: object


In [14]:
print(df.loc['VN':'KH', 'country':'population'])  # Access a range of rows and columns

     country  population
VN   Vietnam    97338579
KH  Cambodia    16718971


* iloc
  - Purpose: iloc is used for positional indexing. It allows you to select rows and columns by their integer positions.
  - Usage: iloc[row_index, column_index]

Example:

In [15]:
print(df.iloc[0])  # Access the first row (index position 0)

country        Vietnam
population    97338579
square          331212
Name: VN, dtype: object


In [16]:
print(df.iloc[0, 1])  # Access the first row, second column (index position [0, 1])

97338579


In [17]:
print(df.iloc[:, 1])  # Access all rows for the second column (index position 1)

VN    97338579
KH    16718971
LA     7275556
Name: population, dtype: int64


In [18]:
print(df.iloc[0:2, 0:2])  # Access a range of rows and columns by index positions

     country  population
VN   Vietnam    97338579
KH  Cambodia    16718971


## Filtering

Filtering is performed using so-called Boolean arrays.


In [19]:
print(df[df.square > 200000][['country', 'population']])

    country  population
VN  Vietnam    97338579
LA     Laos     7275556


# Deleting columns

You can delete a column using the drop() function.


In [20]:
print(df)
df2 = df.drop(['population'], axis='columns')
print(df2)

     country  population  square
VN   Vietnam    97338579  331212
KH  Cambodia    16718971  181035
LA      Laos     7275556  236800
     country  square
VN   Vietnam  331212
KH  Cambodia  181035
LA      Laos  236800


# Reading from and writing to a file

Pandas supports many popular file formats including CSV, XML, HTML, Excel, SQL, JSON, etc.
Out of all of these, CSV is the file format that you will work with the most.
You can read in the data from a CSV file using the read_csv() function.

In [21]:
covid19 =  pd.read_csv('https://raw.githubusercontent.com/huynhhoc/DataAnalystDeepLearning/main/Data/covid19/countriessample.csv')


Similarly, you can write a data frame to a csv file with the to_csv() function.

In [22]:
covid19.to_csv('covid19_copy.csv')

# pandas Questions

1. What is the primary purpose of pandas in Python?

A) Web development
B) Data manipulation and analysis
C) Numerical computing
D) Text processing
Answer: B

2. How do you import pandas in a Python script?

A) import pd
B) import pandas as pd
C) import pandas
D) import pd as pandas
Answer: B

3. Which function reads a CSV file into a DataFrame?

A) pd.read_file()
B) pd.read_csv()
C) pd.load_csv()
D) pd.open_csv()
Answer: B

4. How do you display the first 5 rows of a DataFrame df?

A) df.head(5)
B) df.head()
C) df.tail(5)
D) df.show(5)
Answer: B

5. How do you select a column named "age" from DataFrame df?

A) df.age
B) df['age']
C) df.loc[:, 'age']
D) All of the above
Answer: D

6. Which method is used to remove missing values from a DataFrame?

A) df.remove()
B) df.dropna()
C) df.drop()
D) df.clean()
Answer: B

7. What does df.describe() provide?

A) Summary statistics of the DataFrame
B) The shape of the DataFrame
C) The column names of the DataFrame
D) The data types of the DataFrame
Answer: A

8. How do you group DataFrame df by a column named "gender"?

A) df.groupby('gender')
B) df.split('gender')
C) df.divide('gender')
D) df.group('gender')
Answer: A

9. How do you sort DataFrame df by a column named "age"?

A) df.sortby('age')
B) df.orderby('age')
C) df.sort_values('age')
D) df.sort('age')
Answer: C

10. What is the output of df.shape?

A) The data types of the DataFrame
B) The number of rows and columns in the DataFrame
C) The summary statistics of the DataFrame
D) The first few rows of the DataFrame
Answer: B

11. How do you rename a column in DataFrame df from 'old_name' to 'new_name'?

A) df.rename(columns={'old_name': 'new_name'})
B) df.columns['old_name'] = 'new_name'
C) df.change_column('old_name', 'new_name')
D) df.rename_column('old_name', 'new_name')
Answer: A

12. Which method is used to fill missing values in a DataFrame?

A) df.fillna()
B) df.dropna()
C) df.replace_na()
D) df.fill()
Answer: A

13. How do you get the column names of DataFrame df?

A) df.columns()
B) df.columns
C) df.get_columns()
D) df.column_names()
Answer: B

14. How do you filter rows in DataFrame df where the column 'age' is greater than 30?

A) df[df['age'] > 30]
B) df.filter('age' > 30)
C) df[df.age > 30]
D) Both A and C
Answer: D

15. What does df.isnull() return?

A) Rows with null values
B) A DataFrame of the same shape with boolean values indicating nulls
C) A DataFrame without null values
D) The number of null values in df
Answer: B

16. Which method concatenates DataFrames df1 and df2 vertically?

A) pd.concat([df1, df2])
B) pd.append([df1, df2])
C) pd.merge([df1, df2])
D) pd.join([df1, df2])
Answer: A

17. How do you reset the index of DataFrame df?

A) df.reset()
B) df.reset_index()
C) df.index.reset()
D) df.reindex()
Answer: B

18. How do you drop a column named 'age' in DataFrame df?

A) df.drop_column('age')
B) df.drop(columns='age')
C) df.drop('age', axis=1)
D) Both B and C
Answer: D

19. What does df.apply(np.sum) do?

A) Applies the sum function to each element in df
B) Applies the sum function along each column in df
C) Applies the sum function along each row in df
D) Both B and C
Answer: B

20. Which function merges two DataFrames df1 and df2 on a common column 'key'?

A) pd.concat([df1, df2], on='key')
B) pd.merge([df1, df2], on='key')
C) df1.merge(df2, on='key')
D) Both B and C
Answer: C