# Search and Filter DataFrames in PySpark HW

Now it's time to put what you've learn into action with a homework assignment!

In case you need it again, here is the link to the documentation for the full list available function in pyspark.sql.functions library:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions


### First set up your Spark Session!
Alright so first things first, let's start up our pyspark instance.

In [1]:
# First let's create our PySpark instance
# import findspark
# findspark.init()

import pyspark  # only run after findspark.init()
from pyspark.sql import SparkSession

# May take awhile locally
spark = SparkSession.builder.appName("Select").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

22/10/08 21:52:37 WARN Utils: Your hostname, masoud-ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.7.139 instead (on interface wlp2s0)
22/10/08 21:52:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/08 21:52:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
You are working with 1 core(s)


## Read in the DataFrame for this Notebook

We will be continuing to use the fifa19.csv file for this notebook. Make sure that you are writting the correct path to the file. 

In [2]:
fifa_df = spark.read.csv("Datasets/fifa19.csv", inferSchema=True, header=True)

                                                                                

## About this dataframe

The **fifa19.csv** dataset includes a list of all the FIFA 2019 players and their attributes listed below: 

 - **General**: Age, Nationality, Overall, Potential, Club
 - **Metrics:** Value, Wage
 - **Player Descriptive:** Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight
 - **Possition:** LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, 
 - **Other:** Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

**Source:** https://www.kaggle.com/karangadiya/fifa19

Use the .toPandas() method to view the first few lines of the dataset so we know what we are working with. 

In [4]:
fifa_df.limit(3).toPandas()

22/10/08 21:54:53 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , ID, Name, Age, Photo, Nationality, Flag, Overall, Potential, Club, Club Logo, Value, Wage, Special, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Body Type, Real Face, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, HeadingAccuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, Release Clause
 Schema: _c0, ID, Name, Age, Photo, Nationality, Flag, Overall, Potential, Club, Club Logo, Value,

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,€228.1M


Now print the schema of the dataset so we can see the data types of all the varaibles. 

In [5]:
fifa_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Photo: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Flag: string (nullable = true)
 |-- Overall: integer (nullable = true)
 |-- Potential: integer (nullable = true)
 |-- Club: string (nullable = true)
 |-- Club Logo: string (nullable = true)
 |-- Value: string (nullable = true)
 |-- Wage: string (nullable = true)
 |-- Special: integer (nullable = true)
 |-- Preferred Foot: string (nullable = true)
 |-- International Reputation: integer (nullable = true)
 |-- Weak Foot: integer (nullable = true)
 |-- Skill Moves: integer (nullable = true)
 |-- Work Rate: string (nullable = true)
 |-- Body Type: string (nullable = true)
 |-- Real Face: string (nullable = true)
 |-- Position: string (nullable = true)
 |-- Jersey Number: integer (nullable = true)
 |-- Joined: string (nullable = true)
 |-- Loaned From: string (nu

## Now let's get started!

### First things first..... import the pyspark sql functions library

Since we know we will be using it a lot.

In [7]:
import pyspark.sql.functions as F

### 1. Select the Name and Position of each player in the dataframe

In [9]:
fifa_df.select(["Name", "Position"]).show()

+-----------------+--------+
|             Name|Position|
+-----------------+--------+
|         L. Messi|      RF|
|Cristiano Ronaldo|      ST|
|        Neymar Jr|      LW|
|           De Gea|      GK|
|     K. De Bruyne|     RCM|
|        E. Hazard|      LF|
|        L. Modrić|     RCM|
|        L. Suárez|      RS|
|     Sergio Ramos|     RCB|
|         J. Oblak|      GK|
|   R. Lewandowski|      ST|
|         T. Kroos|     LCM|
|         D. Godín|      CB|
|      David Silva|     LCM|
|         N. Kanté|     LDM|
|        P. Dybala|      LF|
|          H. Kane|      ST|
|     A. Griezmann|     CAM|
|    M. ter Stegen|      GK|
|      T. Courtois|      GK|
+-----------------+--------+
only showing top 20 rows



### 1.1 Display the same results from above sorted by the players names

In [15]:
fifa_df.select(["Name", "Position"]).orderBy("Name").show()

+--------------+--------+
|          Name|Position|
+--------------+--------+
|      A. Abang|      ST|
| A. Abdellaoui|      LB|
|  A. Abdennour|      CB|
|       A. Abdi|      CM|
| A. Abdu Jaber|      ST|
|A. Abdulhameed|      GK|
|  A. Abedzadeh|      GK|
|      A. Abeid|      LB|
|      A. Ablet|     LWB|
|    A. Abrashi|     CDM|
|   A. Abruscia|    null|
|    A. Absalem|      LB|
|    A. Accardi|      CB|
|    A. Acevedo|     RCB|
|     A. Acosta|      LB|
|     A. Acosta|      RM|
|     A. Acquah|     LCM|
|       A. Adam|      ST|
|      A. Addai|      ST|
|      A. Ademi|     RDM|
+--------------+--------+
only showing top 20 rows



### 2. Select only the players who belong to a club begining with FC

In [17]:
fifa_df.select(["Name", "Club"]).where(fifa_df["Club"].startswith("FC")).show()

+---------------+-----------------+
|           Name|             Club|
+---------------+-----------------+
|       L. Messi|     FC Barcelona|
|      L. Suárez|     FC Barcelona|
| R. Lewandowski|FC Bayern München|
|  M. ter Stegen|     FC Barcelona|
|Sergio Busquets|     FC Barcelona|
|       M. Neuer|FC Bayern München|
|   J. Rodríguez|FC Bayern München|
|       Coutinho|     FC Barcelona|
|     M. Hummels|FC Bayern München|
|      S. Umtiti|     FC Barcelona|
|     Jordi Alba|     FC Barcelona|
|     I. Rakitić|     FC Barcelona|
|          Piqué|     FC Barcelona|
|      T. Müller|FC Bayern München|
|         Thiago|FC Bayern München|
|     J. Kimmich|FC Bayern München|
|       D. Alaba|FC Bayern München|
|     Y. Brahimi|         FC Porto|
|     J. Boateng|FC Bayern München|
|       A. Vidal|     FC Barcelona|
+---------------+-----------------+
only showing top 20 rows



### 3. Who is the oldest player in the dataset and how old are they?

Display only the name and age of the oldest player.

In [20]:
fifa_df.select(["Name", "Age"]).orderBy(F.desc("Age")).limit(1).show()

+--------+---+
|    Name|Age|
+--------+---+
|O. Pérez| 45|
+--------+---+



### 4. Select only the following players from the dataframe:

 - L. Messi
 - Cristiano Ronaldo

In [21]:
fifa_df.select(["Name", "Club"]).where(
    fifa_df["Name"].isin(["L. Messi", "Cristiano Ronaldo"])
).show()

+-----------------+------------+
|             Name|        Club|
+-----------------+------------+
|         L. Messi|FC Barcelona|
|Cristiano Ronaldo|    Juventus|
+-----------------+------------+



### 5. Can you select the first character from the Release Clause variable which indicates the currency used?

In [23]:
fifa_df.select(fifa_df["Release Clause"].substr(0, 1)).show()

+-------------------------------+
|substring(Release Clause, 0, 1)|
+-------------------------------+
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
|                              €|
+-------------------------------+
only showing top 20 rows



### 6. Can you select only the players who are over the age of 40?

In [26]:
fifa_df.select(["Name", "Age"]).filter("Age>40").toPandas()

Unnamed: 0,Name,Age
0,J. Villar,41
1,B. Nivet,41
2,O. Pérez,45
3,C. Muñoz,41
4,S. Narazaki,42
5,H. Sulaimani,41
6,M. Tyler,41
7,T. Warner,44
8,K. Pilkington,44


### That's is for now... Great Job!