### <center style = 'color : red '>  PANDAS : Python Data Analysis Library </center>
مكتبة بايثون لتحليل البيانات 

The Pandas library is built on NumPy and provides easy-to-use
data structures and data analysis tools for the Python
programming language

مكتبة بانداس هي مبنية على مكتبة نامباي و توفر هياكل بنيوية سهلة الاستعمال و أدوات لتحليل البيانات من أجل البرمجة باستعمال لغة بايثون

https://pandas.pydata.org/

In [None]:
import numpy as np
import pandas as pd

* The **pandas series** object can be seen as an enhanced numpy 1D array and the **pandas dataframe** can be seen as an enhanced numpy 2D array. 

يمكن اعتبار السلاسل في بانداس كمصفوفة نامباي مدعمة و اطار البيانات كمصفوفة نامباي من الدرجة الثانية و التي هي الاخرى محسنة 

* The main difference is that pandas series and pandas dataframes has explicit index, while numpy arrays has implicit indexation. So, in any python code that you think to use something like

الفرق الاهم هو كون السلاسل تتوفر على تاشير صريح بينما المصفوفات في نامباي هي تتوفر فقط على تاشير ضمني

```python
import numpy as np
a = np.array([1,2,3])
```
you can just use
```python
import pandas as pd
a = pd.Series([1,2,3])
```

All the functions and methods from numpy arrays will work with pandas series. 

مختلف الوظائف في نامباي يمكن استعمالها في بانداس


is there any performance differences between a numpy array and pandas series ?
هل هنالك فرق في الاداء بين نامباي و بانداس؟

in fact Pandas is much slower than NumPy, because Pandas is doing a lot of stuff when you index into a Series, and it’s doing that stuff in Python

## <center style = "color:blue"> series </center>

A one-dimensional labeled array capable of holding any data type 
مصفوفة احادية البعد قادرة على احتواء اي نوع من البيانات 

* pandas create a default integer index بانداس يخلق قهرس رقمي بشكل تلقائي

In [None]:
s = pd.Series(
    [3,-2,1,4]
)

In [None]:
s

In [None]:
type(s)

In [None]:
# Get one element
s[0]

### index of string - استعمال فهرس غير رقمي

In [None]:
s = pd.Series(
    [3,-2,1,4], 
    index=['a', 'b', 'c', 'd']
)

In [None]:
s['a'] 

### index of dates - فهرس من التواريخ

In [None]:
dates = pd.date_range('20190301', periods=6)
dates

In [None]:
s = pd.Series(np.random.randn(6), 
                  index=dates)
s


### Boolean Indexing

In [None]:
s = pd.Series([3,-2,1,4])
s

In [None]:
s[3]

In [None]:
# Series s where value is >1
s[s > 1] 

In [None]:
# Series s where value is not >1
s[~(s > 1)] 

In [None]:
# s where value is <-1 or >2
s[(s < -1) | (s >= 3)] 

In [None]:
# Setting
s[2] = -6 # Set index a of Series s to 6

### <center style = 'color : blue '> Data Alignment </center>

In [29]:
s = pd.Series(
    [3,-2,1,4,1], 
    index=['a', 'b', 'c', 'd','e']
)

s

a    3
b   -2
c    1
d    4
e    1
dtype: int64

In [30]:
s2 = pd.Series([-2, 7, 3, -5, 1], index=['a', 'c', 'd','e','f'])
s2

a   -2
c    7
d    3
e   -5
f    1
dtype: int64

### add

In [31]:
s.add(s2)

a    1.0
b    NaN
c    8.0
d    7.0
e   -4.0
f    NaN
dtype: float64

In [32]:
s.add(s2, fill_value=0)

a    1.0
b   -2.0
c    8.0
d    7.0
e   -4.0
f    1.0
dtype: float64

### sub

In [33]:
s.sub(s2, fill_value=0)

a    5.0
b   -2.0
c   -6.0
d    1.0
e    6.0
f   -1.0
dtype: float64

In [36]:
s.sub(s2,fill_value=2)

a    5.0
b   -4.0
c   -6.0
d    1.0
e    6.0
f    1.0
dtype: float64

### div

In [38]:
s.div(s2, fill_value=4)

a   -1.500000
b   -0.500000
c    0.142857
d    1.333333
e   -0.200000
f    4.000000
dtype: float64

### mul 

In [39]:
s.mul(s2, fill_value=0)

a    -6.0
b    -0.0
c     7.0
d    12.0
e    -5.0
f     0.0
dtype: float64

### Drop a values - حذف قيم

In [40]:
s

a    3
b   -2
c    1
d    4
e    1
dtype: int64

In [41]:
# Drop values from rows (axis=0)
s.drop(['a', 'c']) 

b   -2
d    4
e    1
dtype: int64

## <center style = "color:blue"> Data Frame - إطار البيانات </center>

A two-dimensional labeled data structure with columns of potentially different types 
هيكل بيانات ثنائي الابعاد باعمدة قد تحتوي على بيانات من انواع مختلفة

### Create a dataframe from a dictionary -   خلق إطار البيانات

In [42]:
data = {
    'Country': ['Morocco', 'China', 'France','Indonesia','Spain'],
    'Capital': ['Rabat', 'Beijing', 'Paris','Jakarta','Madrid'],
    'Population': [30000000, 1000000000, 80000000,130000000,75000000]
}

df = pd.DataFrame(
    data,
    columns=['Country', 'Capital', 'Population']
)

In [43]:
df

Unnamed: 0,Country,Capital,Population
0,Morocco,Rabat,30000000
1,China,Beijing,1000000000
2,France,Paris,80000000
3,Indonesia,Jakarta,130000000
4,Spain,Madrid,75000000


In [45]:
df.head(2) # الحصول على اول الصفوف

Unnamed: 0,Country,Capital,Population
0,Morocco,Rabat,30000000
1,China,Beijing,1000000000


In [46]:
df.tail(3) # الحصول على اخر الصفوف

Unnamed: 0,Country,Capital,Population
2,France,Paris,80000000
3,Indonesia,Jakarta,130000000
4,Spain,Madrid,75000000


In [47]:
df.info() # Info on DataFrame معلومات عن 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
Country       5 non-null object
Capital       5 non-null object
Population    5 non-null int64
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes


In [48]:
print(df.index) # الحصول على الفهرس 
print(df.columns) # لائحة الاعمدة

RangeIndex(start=0, stop=5, step=1)
Index(['Country', 'Capital', 'Population'], dtype='object')


In [49]:
print('shape : ', df.shape) # (rows,columns) شكل الهيكل

shape :  (5, 3)


In [50]:
df.describe() # Summary statistics - احصاءات

Unnamed: 0,Population
count,5.0
mean,263000000.0
std,413515400.0
min,30000000.0
25%,75000000.0
50%,80000000.0
75%,130000000.0
max,1000000000.0


### Data frame selection & filtering - التصفية

In [51]:
df['Country']

0      Morocco
1        China
2       France
3    Indonesia
4        Spain
Name: Country, dtype: object

In [52]:
df[['Country','Population']] # اخد عمودين فقط

Unnamed: 0,Country,Population
0,Morocco,30000000
1,China,1000000000
2,France,80000000
3,Indonesia,130000000
4,Spain,75000000


In [53]:
# boolean filtering -  عبارة منطقية او بولية
df[df['Population']>60000000] # Use filter to adjust DataFrame 

Unnamed: 0,Country,Capital,Population
1,China,Beijing,1000000000
2,France,Paris,80000000
3,Indonesia,Jakarta,130000000
4,Spain,Madrid,75000000


In [54]:
# Getting element  
df[:] # Get subset of a DataFram - الحصول على مجموعة ضمنية

Unnamed: 0,Country,Capital,Population
0,Morocco,Rabat,30000000
1,China,Beijing,1000000000
2,France,Paris,80000000
3,Indonesia,Jakarta,130000000
4,Spain,Madrid,75000000


In [56]:
# Getting element  
df[1:3] # Get subset of a DataFram

Unnamed: 0,Country,Capital,Population
1,China,Beijing,1000000000
2,France,Paris,80000000


-----

In [57]:
df[1:3,2] # attention : error

TypeError: unhashable type: 'slice'

In [58]:
df[1:,['Capital']] # attention : error

TypeError: unhashable type: 'slice'

### loc (locate) - حدد

In [59]:
df.loc[0]

Country        Morocco
Capital          Rabat
Population    30000000
Name: 0, dtype: object

In [60]:
# By Label
df.loc[[0], ['Country']] #Select single value by row &column labels


Unnamed: 0,Country
0,Morocco


In [65]:
# By Label
df.loc[:, ['Population']] #Select single value by row &column labels

Unnamed: 0,Population
0,30000000
1,1000000000
2,80000000
3,130000000
4,75000000


#### iloc (index location) - التحديد باستعمال التاشير

In [66]:
# By Position
df.iloc[0,0] # Select single value by row & column

'Morocco'

In [67]:
# By Position
df.iloc[1:4,[0,1]] # Select single value by row & column


Unnamed: 0,Country,Capital
1,China,Beijing
2,France,Paris
3,Indonesia,Jakarta


### Dropping columns/rows - حذف عمود او صف

In [68]:
#Dropping
df.drop('Capital', axis=1) # Drop values from columns(axis=1)

Unnamed: 0,Country,Population
0,Morocco,30000000
1,China,1000000000
2,France,80000000
3,Indonesia,130000000
4,Spain,75000000


In [71]:
df.drop('Capital', axis=1, inplace=True) 

In [72]:
df

Unnamed: 0,Country,Population
0,Morocco,30000000
1,China,1000000000
2,France,80000000
3,Indonesia,130000000
4,Spain,75000000


### Sort & Rank - الفرز و الترتيب

In [None]:
df.sort_index() #Sort by labels along an axis

In [None]:
# Sort by the values along an axis
df.sort_values(by='Population') # by = 'Country'

In [None]:
# Sort by the values along an axis
df.sort_values(by = 'Country') # 

In [None]:
# Assign ranks to entries
df.rank() 

In [None]:
df['Population_rank'] = df['Population'].rank()
df

### Summary

### sum - المجموع

In [None]:
df.sum() # Sum of values

### cumsum - الجمع التراكمي

In [None]:
df.cumsum() # Cummulative sum of values

###  min and max

In [None]:
print(df["Population"].min())
df["Population"].max() # Minimum/maximum values


### how much percentage does chin'as population represents in our dataset
### كم تمثل ساكنة الصين بالنسبة للساكنة في قاعدة بياناتنا

In [None]:
df[df['Country']=='China']['Population'] / df['Population'].cumsum().max()

### idxmin and idxmax - الحصول على باشير اكبر و اصغر قيمة

In [None]:
df['Population'].idxmin()

In [None]:
df['Population'].idxmax() # Minimum/Maximum index value


## mean & median

In [None]:
df.mean() # Mean of values


In [None]:
df.median() # Median of values

### <center style='color:blue'> Applying Functions - تطبيق دالة</center>


In [None]:
f = lambda x: x*2

In [None]:
df.apply(f) # Apply function

In [None]:
df.applymap(f) # Apply function element-wise

In [None]:
df['Population'].apply(lambda x:x/3+1) # Apply function

In [None]:
df['Population'] # didn't change ?!

### create a new column - انشائ عمود جديد

In [None]:
df['new_column'] = df['Population'].apply(lambda x:x/3+1) # Apply function


In [None]:
df

### <center style='color:blue'> create dataframe from a file - الانشائ انطلاقا من ملف </center> 

### csv file 

In [None]:
df = pd.read_csv("data\\Fire_Department_Calls_for_Service_03_23_2019.csv",
                sep = ',')
df.head()


### excel file

In [None]:
df = pd.read_excel("data\\Fire_Department_Calls_for_Service_03_23_2019.xlsx",
                  sheet_name='Sheet1',
                   #index_col=None, 
                   #header=None,
                   dtype={'Name': str, 'Value': float})
df.head(2)

### <center style='color:blue'>DataFrame to Numpy NDArray </center>
### <center style='color:blue'> من بانداس الى نامباي </center>

In [None]:
a = df.values

In [None]:
type(a)

In [None]:
a

RSC:
* https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf
* https://pandas.pydata.org/pandas-docs/version/0.22.0/10min.html
* https://realpython.com/python-data-cleaning-numpy-pandas/