# Pandas

- Pandas provides high-level data structures and functions designed to make working with structured or tabular data fast and easy.

- Pandas is a package built on top of NumPy and provides an efficient implementation of a DataFrame.

- Primary objects in pandas that will be used in this book are the DataFrame,a tabular, column-oriented data structure with both row and column labels, and the Series, a one-dimensional labeled array object.

- Data frames are essentially multidimensional arrays with attached row and column labels, that contain heterogeneous types and/or missing data.

- Pandas data frames have sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data.

- Pandas implements several powerful data
operations familiar to users of both database frameworks and spreadsheet programs.

- Pandas is a tool for data processing that helps in data analysis.

- Pandas provides functions and methods to efficiently manipulate large datasets.

In [None]:
import numpy as np
import pandas as pd
import math

# Series

A pandas Series is a one-dimensional array of indexed data. It can be created from a list or array. For example,

In [None]:
series = pd.Series([0.25, 1.5, 3, 7.5, 11, 9, math.pi, 2**0.5, math.e, np.log10(100)])
series

In [None]:
len(series)

As we see in the preceding output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes.

The values are simply a familiar NumPy array.

In [None]:
series.values

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation.

In [None]:
series[1]

In [None]:
series[1:4]

- NumPy array has an implicitly defined integer index used to access the values. 

- The Pandas Series has an explicitly defined index associated with the values.

In [None]:
data = pd.Series([1.5, 6, 9, 10, 5], index = ['a', 'b', 'c', 'd', 'e'])
data

This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.

In [None]:
data['a']

- A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure that
maps typed keys to a set of typed values.

- This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

In [None]:
population_dictionary = {'California' : 38332521,
                         'Texas' : 26448193,
                         'New York' : 19651127,
                         'Florida' : 19552860,
                         'Illinois' : 128821135}
population_dictionary

In [None]:
population = pd.Series(population_dictionary)
population

In [None]:
population['New York']

In [None]:
population['California':'Florida']

**Question1.** Create a Series of the first 100 integers.

In [None]:
pd.Series(np.arange(1,101))

**Question 2.** Create a series from a dictionary using the following data. 

|City Name|Airport Abbreviation|
|---------|--------------------|
|Dubai    |DBX|
|Abu Dhabi|AUH|
|Sharjah|SHJ|

Make the city the key and the airport abbreviation the value.


**Question 3.** Create a series using the following data.

|State Name|Capital|
|---------|--------------------|
|California|Los Angeles|
|Texas|Dallas|
|North Carolina|Raleigh|

Make the state name the index.

# Data Frame

The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

- If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a twodimensional array with both flexible row indices and flexible column names.

- Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. 
  
  **Note:** Here, by “aligned” we mean that they share the same index.

In [None]:
area_dictionary = {'California' : 432967,
                   'Texas' : 695662,
                   'New York' : 141297,
                   'Florida' : 170312,
                   'Illinois' : 149995}
area_dictionary

In [None]:
area = pd.Series(area_dictionary)
area

Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information.

In [None]:
states = pd.DataFrame({'population' : population,
                       'area': area})

states

In [None]:
states.index

In [None]:
states.columns

- Similarly, we can also think of a DataFrame as a specialization of a dictionary. 

- Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data.

- For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier:

In [None]:
states['area']

A Pandas Data Frame can be constructed in a variety of ways.

In [None]:
pd.DataFrame(population, columns = ['population'])

In [None]:
data = [{'a' : i, 'b' : 2*i} for i in np.arange(3)]
data

In [None]:
pd.DataFrame(data)

In [None]:
pd.DataFrame([{'a' : 1, 'b': 2}, {'b' : 3, 'c' : 4}])

- Given a two-dimensional array of data, we can create a Data Frame with any specified column and index names.

- If omitted, an integer index will be used for each:

In [None]:
pd.DataFrame(np.random.rand(3,2),
             columns = ['foo', 'bar'])

In [None]:
pd.DataFrame(np.random.rand(3,2),
             columns = ['foo', 'bar'],
             index = ['a', 'b', 'c'])

**Question 4.** Create a data frame using the following data.

|name|age|height|
|----|---|------|
|jean|15|150|
|jake|17|161|
|joe|13|135|

**Question 5.** Create the same data frame as you did in **Question 4** but make the name the index and add a column named `hair_color`.

||age|height|hair_color|
|----|---|------|------|
|**jean**|15|150|brown|
|**jake**|17|161|black|
|**joe**|13|135|brown|

# Manipulating Data Frames

In [None]:
url = 'https://raw.githubusercontent.com/mahmoudharding/exceed/main/data/2022_population.csv'
world = pd.read_csv(url)
world