# Data Analysis using Python Libraries


## Creating a NumPy array
One of the most common, of the many, ways to create a NumPy array is to create one from a list by passing it to the np.array() function.

In [1]:
import numpy as np
list1 = [0, 1, 2, 3, 4]
arr = np.array(list1)

print(type(arr))
print(arr)

<class 'numpy.ndarray'>
[0 1 2 3 4]


## Difference between lists and arrays
The key difference between an array and a list is that arrays are designed to handle vectorised operations while a python lists are not. That means, if you apply a function, it is performed on every item in the array, rather than on the whole array object.

In [2]:
list1 = [0, 1, 2, 3, 4]
arr = np.array(list1)
print(arr)
arr=arr+2
print(arr)

[0 1 2 3 4]
[2 3 4 5 6]


## Create a 2d array from a list of list
You can pass a list of lists to create a matrix-like a 2d array.

In [3]:
list2 = [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
arr2 = np.array(list2)
print(arr2)

[[0 1 2]
 [3 4 5]
 [6 7 8]]


## The dtype argument
You can specify the data-type by setting the dtype() argument. Some of the most commonly used NumPy dtypes are: float, int, bool, str, and object.

In [4]:
list2 = [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
arr3 = np.array(list2, dtype='float')
print(arr3)

[[0. 1. 2.]
 [3. 4. 5.]
 [6. 7. 8.]]


## The astype argument
You can also convert it to a different data-type using the astype method. Remember that, unlike lists, all items in an array have to be of the same
type.

In [5]:
list2 = [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
arr3 = np.array(list2, dtype='float')
print(arr3)
arr3_s = arr3.astype('int').astype('str')
print(arr3_s)

[[0. 1. 2.]
 [3. 4. 5.]
 [6. 7. 8.]]
[['0' '1' '2']
 ['3' '4' '5']
 ['6' '7' '8']]


## dtype=‘object’
However, if you are uncertain about what data type your array will hold, or if you want to hold characters and numbers in the same array, you can set the dtype as 'object'.

In [6]:
arr_obj = np.array([1, 'a'], dtype='object')
print(arr_obj)

[1 'a']


## The tolist() function
You can always convert an array into a list using the tolist() command.

In [7]:
arr_list = arr3.tolist()
print(arr_list)

[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]]


## Inspecting a NumPy array
There are a range of functions built into NumPy that allow you to inspect different aspects of an array.

In [8]:
list2 = [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
arr3 = np.array(list2, dtype='float')
print('Shape:', arr3.shape)
print('Data type:', arr3.dtype) 
print('size:', arr3.size)
print('Number of dimensions:', arr3.ndim)

Shape: (3, 3)
Data type: float64
size: 9
Number of dimensions: 2


## Extracting specific items from an array
You can extract portions of the array using indices, much like when you’re working with lists. Unlike lists, however, arrays can optionally accept as many parameters in the square brackets as there are number of dimensions

In [9]:
list2 = [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
arr3 = np.array(list2, dtype='float')
print("whole:", arr3)
print("part:", arr3[:2, :2]) # first two rows and first two columns

whole: [[0. 1. 2.]
 [3. 4. 5.]
 [6. 7. 8.]]
part: [[0. 1.]
 [3. 4.]]


## Boolean indexing
A boolean index array is of the same shape as the array-to-be-filtered, but it only contains TRUE and FALSE values.

In [10]:
list2 = [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
arr3 = np.array(list2, dtype='float')
boo = arr3 > 2 # value greater than 2
print(boo) 

[[False False False]
 [ True  True  True]
 [ True  True  True]]


## Indices in a pandas series
A pandas series is similar to a list, but differs in the fact that a series associates a label with each element. This makes it look like a dictionary If an index is not explicitly provided by the user, pandas creates a RangeIndex ranging from 0 to N-1. Each series object also has a data type.

In [11]:
import pandas as pd
new_series = pd.Series([5, 6, 7, 8, 9, 10])
print(new_series)

0     5
1     6
2     7
3     8
4     9
5    10
dtype: int64


As you may suspect by this point, a series has ways to extract all of
the values in the series, as well as individual elements by index.

In [12]:
new_series = pd.Series([5, 6, 7, 8, 9, 10])
print(new_series.values)
print('--------------------------------')
print(new_series[4]) # access by index

[ 5  6  7  8  9 10]
--------------------------------
9


You can also provide an index manually.

In [13]:
new_series = pd.Series([5, 6, 7, 8, 9, 10], index=['a', 'b', 'c', 'd', 'e', 'f'])
print(new_series.values)
print('--------------------------------')
print(new_series['f']) # access by label

[ 5  6  7  8  9 10]
--------------------------------
10


It is easy to retrieve several elements of a series by their indices or make group assignments.

In [14]:
new_series = pd.Series([5, 6, 7, 8, 9, 10], index=['a', 'b', 'c', 'd', 'e', 'f'])
print(new_series)
print('--------------------------------')
new_series[['a', 'b', 'f']] = 0 # assign multiple values
print(new_series)

a     5
b     6
c     7
d     8
e     9
f    10
dtype: int64
--------------------------------
a    0
b    0
c    7
d    8
e    9
f    0
dtype: int64


## Filtering and maths operations
Filtering and maths operations are easy with Pandas as well.

In [15]:
new_series = pd.Series([5, 6, 7, 8, 9, 10], index=['a', 'b', 'c', 'd', 'e', 'f'])
new_series2 = new_series[new_series > 5] # filter values greater than 5
print(new_series2)
print('--------------------------------')
new_series2[new_series2 > 5]*=2 # double the values greater than 5
print(new_series2)

b     6
c     7
d     8
e     9
f    10
dtype: int64
--------------------------------
b    12
c    14
d    16
e    18
f    20
dtype: int64


## Creating a Pandas data frame
Pandas data frames can be constructed using Python dictionaries.

In [16]:
df = pd.DataFrame({
    'country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
    'population': [18.8, 144.5, 9.5, 43.7],
    'square': [2724902, 17098242, 207600, 603628]
})
print(df)

      country  population    square
0  Kazakhstan        18.8   2724902
1      Russia       144.5  17098242
2     Belarus         9.5    207600
3     Ukraine        43.7    603628


You can also create a data frame from a list.

In [17]:
list2 = [[0, 1, 2], [3, 4, 5], [6, 7, 8]]  
df = pd.DataFrame(list2)
print(df)
df.columns = ['V1', 'V2', 'V3'] # assign column names
print(df)

   0  1  2
0  0  1  2
1  3  4  5
2  6  7  8
   V1  V2  V3
0   0   1   2
1   3   4   5
2   6   7   8


You can ascertain the type of a column with the type() function.

In [18]:
df = pd.DataFrame({
    'country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
    'population': [18.8, 144.5, 9.5, 43.7],
    'square': [2724902, 17098242, 207600, 603628]
})
print(type(df['country'])) # a column is a Pandas Series object

<class 'pandas.core.series.Series'>


A Pandas data frame object as two indices; a column index and row index. Again, if you do not provide one, Pandas will create a RangeIndex from 0 to N-1.

In [19]:
df = pd.DataFrame({
    'country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
    'population': [18.8, 144.5, 9.5, 43.7],
    'square': [2724902, 17098242, 207600, 603628]
})
print(df.columns) # print column names
print('----------------')
print(df.index) # print row indices

Index(['country', 'population', 'square'], dtype='object')
----------------
RangeIndex(start=0, stop=4, step=1)


There are numerous ways to provide row indices explicitly. For example, you could provide an index when creating a data frame.

In [20]:
df = pd.DataFrame({
    'country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
    'population': [18.8, 144.5, 9.5, 43.7],
    'square': [2724902, 17098242, 207600, 603628]
}, index=['KZ', 'RU', 'BY', 'UA']) # provide custom row indices
print(df)

       country  population    square
KZ  Kazakhstan        18.8   2724902
RU      Russia       144.5  17098242
BY     Belarus         9.5    207600
UA     Ukraine        43.7    603628


or do it during runtime. Here, I also named the index ‘country code’.

In [21]:
df = pd.DataFrame({
    'country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
    'population': [18.8, 144.5, 9.5, 43.7],
    'square': [2724902, 17098242, 207600, 603628]
})
print(df)
print('----------------')
df.index = ['KZ', 'RU', 'BY', 'UA'] # assign custom row indices
df.index.name = 'country code' # name the index
print(df)

      country  population    square
0  Kazakhstan        18.8   2724902
1      Russia       144.5  17098242
2     Belarus         9.5    207600
3     Ukraine        43.7    603628
----------------
                 country  population    square
country code                                  
KZ            Kazakhstan        18.8   2724902
RU                Russia       144.5  17098242
BY               Belarus         9.5    207600
UA               Ukraine        43.7    603628


Row access using index can be performed in several ways. First, you could use .loc() and provide an index label.

In [22]:
print(df.loc['KZ']) # access row by index label

country       Kazakhstan
population          18.8
square           2724902
Name: KZ, dtype: object


Second, you could use .iloc() and provide an index number

In [23]:
print(df.iloc[0]) # access row by integer position

country       Kazakhstan
population          18.8
square           2724902
Name: KZ, dtype: object


A selection of particular rows and columns can be selected this way.

In [24]:
print(df.loc[['KZ', 'RU'], 'population']) # access multiple rows and a specific column

country code
KZ     18.8
RU    144.5
Name: population, dtype: float64


You can feed .loc() two arguments, index list and column list, slicing operation is supported as well.

In [25]:
print(df.loc['KZ' : 'BY', :]) # slicing rows and all columns

                 country  population    square
country code                                  
KZ            Kazakhstan        18.8   2724902
RU                Russia       144.5  17098242
BY               Belarus         9.5    207600


## Filtering
Filtering is performed using so-called Boolean arrays.

In [26]:
print(df[df.population > 20][['country', 'square']]) # filter rows where population is greater than 20

              country    square
country code                   
RU             Russia  17098242
UA            Ukraine    603628


## Deleting columns
You can delete a column using the drop() function.

In [None]:
print(df)
print('----------------')
df_dropped = df.drop('population', axis='columns') # delete the 'population' column
print(df_dropped)

                 country  population    square
country code                                  
KZ            Kazakhstan        18.8   2724902
RU                Russia       144.5  17098242
BY               Belarus         9.5    207600
UA               Ukraine        43.7    603628
----------------
                 country    square
country code                      
KZ            Kazakhstan   2724902
RU                Russia  17098242
BY               Belarus    207600
UA               Ukraine    603628


## Reading from and writing to a file
Pandas supports many popular file formats including CSV, XML, HTML, Excel, SQL, JSON, etc. Out of all of these, CSV is the file format that you will work with the
most. You can read in the data from a CSV file using the read_csv() function:

df = pd.read_csv('filename.csv', sep=',')

Similarly, you can write a data frame to a csv file with the to_csv() function.

df.to_csv('filename.csv')