Names.csv 
* Dodaj kolumnę z wartością czasu wykonania notatnika w formacie epoch
* Dodaj kolumnę w której wyliczysz wzrost w stopach (feet)
* Odpowiedz na pytanie jakie jest najpopularniesze imię?
* Dodaj kolumnę i policz wiek aktorów 
* Usuń kolumny (bio, death_details)
* Zmień nazwy kolumn - dodaj kapitalizaję i usuń _
* Posortuj dataframe po imieniu rosnąco

In [0]:
import pyspark.sql.functions as F
from datetime import datetime

In [0]:
filePath = "dbfs:/FileStore/tables/Files/names.csv"
namesDf = spark.read.format("csv") \
              .option("header","true") \
              .option("inferSchema","true") \
              .load(filePath)

In [0]:
most_popular_name = namesDf \
  .withColumn('first_name', F.split(F.col("name"), " ").getItem(0)) \
  .groupBy("first_name") \
  .count() \
  .orderBy(F.col("count").desc()) \
  .first()[0]


transformedNamesDf = namesDf \
    .withColumn("epoch_time", F.unix_timestamp(F.current_timestamp())) \
    .withColumn("height_feet", F.col("height") * 0.032) \
    .withColumn("birth_year", F.regexp_extract(F.col("date_of_birth"), r"(\d{4})", 1)) \
    .withColumn("death_year", F.regexp_extract(F.col("date_of_death"), r"(\d{4})", 1)) \
    .withColumn("age", F.col("death_year") - F.col("birth_year")) \
    .withColumn(
        "age",
        F.when((F.col("death_year").isNull()),
              F.year(F.current_date()) - F.year(F.col("birth_year")))
        .otherwise(F.col("age"))
    ) \
    .drop("bio", "death_details") \
    .orderBy(F.col("name").asc())

new_columns = [col.replace("_", " ").title() for col in transformedNamesDf.columns]
transformedNamesDf = transformedNamesDf.toDF(*new_columns)

display(transformedNamesDf.take(10))

Imdb Name Id,Name,Birth Name,Height,Birth Details,Date Of Birth,Place Of Birth,Date Of Death,Place Of Death,Reason Of Death,Spouses String,Spouses,Divorces,Spouses With Children,Children,Epoch Time,Height Feet,Birth Year,Death Year,Age
nm1001478,'Big' LeRoy Mobley,LeRoy King Mobley III,193.0,"April 1, 1973 in Atlantic City, New Jersey, USA",01.04.1973,"Atlantic City, New Jersey, USA",,,,,0,0,0,0,1743685147,6.176,1973,,52.0
nm0521811,'Ducky' Louie,Lawrence Louie,,"July 22, 1931 in Berkeley, California, USA",22.07.1931,"Berkeley, California, USA",,,,,0,0,0,0,1743685147,,1931,,94.0
nm0722372,'Little Billy' Rhodes,William H. Rhodes,,"February 1, 1895 in Illinois, USA",1895-02-01,"Illinois, USA",24.07.1967,"Hollywood, California, USA",stroke,,0,0,0,0,1743685147,,1895,1967.0,72.0
nm0946148,'Weird Al' Yankovic,Alfred Matthew Yankovic,183.0,"October 23, 1959 in Downey, California, USA",23.10.1959,"Downey, California, USA",,,,Suzanne Krajewski (10 February 2001 - present) (1 child),1,0,1,1,1743685147,5.856,1959,,66.0
nm1265067,50 Cent,Curtis James Jackson III,183.0,"July 6, 1975 in Queens, New York City, New York, USA",06.07.1975,"Queens, New York City, New York, USA",,,,,0,0,0,0,1743685147,5.856,1975,,50.0
nm0553436,A Martinez,Adolph Larrue Martinez III,175.0,"September 27, 1948 in Glendale, California, USA",27.09.1948,"Glendale, California, USA",,,,Leslie Bryans (17 July 1982 - present) (3 children)Mare Winningham (1981 - 29 January 1982) (divorced),2,1,1,3,1743685147,5.6000000000000005,1948,,77.0
nm1100197,A. Baldwin Sloane,A. Baldwin Sloane,,"August 28, 1872 in Baltimore, Maryland, USA",1872-08-28,"Baltimore, Maryland, USA",21.02.1925,"Red Bank, New Jersey, USA",,,0,0,0,0,1743685147,,1872,1925.0,53.0
nm0080406,A. Bhimsingh,A. Bhimsingh,,"July 15, 1924 in Tirupati, Andhra Pradesh, India",15.07.1924,"Tirupati, Andhra Pradesh, India",16.01.1978,"Madras, Tamil Nadu, India",,Sukumari (? - 16 January 1978) (his death) (1 child),1,0,1,1,1743685147,,1924,1978.0,54.0
nm0770661,A. Hans Scheirl,Angela Hans Schierl,,"1956 in Salzburg, Austria","1956 in Salzburg, Austria","Salzburg, Austria",,,,,0,0,0,0,1743685147,,1956,,69.0
nm0072200,A. Jonathan Benny,A. Jonathan Benny,,"November 4, 1970",04.11.1970,,,,,,0,0,0,0,1743685147,,1970,,55.0


Movies.csv
* Dodaj kolumnę z wartością czasu wykonania notatnika w formacie epoch
* Dodaj kolumnę która wylicza ile lat upłynęło od publikacji filmu
* Dodaj kolumnę która pokaże budżet filmu jako wartość numeryczną, (trzeba usunac znaki walut)
* Usuń wiersze z dataframe gdzie wartości są null

In [0]:
filePath = "dbfs:/FileStore/tables/Files/movies.csv"
moviesDf = spark.read.format("csv") \
              .option("header","true") \
              .option("inferSchema","true") \
              .load(filePath)



In [0]:
transformedMoviesDf = moviesDf \
  .withColumn("epoch_time", F.unix_timestamp(F.current_timestamp())) \
  .withColumn("years_since_release", F.year(F.current_date()) - F.col("date_published")) \
  .withColumn("budget_numeric", F.regexp_replace(F.col("budget"), r"[\$,€£]", "").cast("double")) \
  .dropna()

display(transformedMoviesDf.take(10))

imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,epoch_time,years_since_release,budget_numeric
tt0071615,La montagna sacra,La montaña sagrada,1973,1975,"Adventure, Drama, Fantasy",114,Mexico,"Spanish, English",Alejandro Jodorowsky,Alejandro Jodorowsky,ABKCO Films,"Alejandro Jodorowsky, Horacio Salinas, Zamira Saunders, Juan Ferrara, Adriana Page, Burt Kleiner, Valerie Jodorowsky, Nicky Nichols, Richard Rutowski, Luis Lomelí, Ana De Sade, Chucho-Chucho, Letícia Robles, Connie De La Mora, David Kapralik","In a corrupt, greed-fueled world, a powerful alchemist leads a Christ-like character and seven materialistic figures to the Holy Mountain, where they hope to achieve enlightenment.",07.wrz,35412,$ 750000,$ 61001,$ 104160,76.0,160.0,100.0,1743685446,50.0,750000.0
tt0075265,È nata una stella,A Star Is Born,1976,1977,"Drama, Music, Romance",139,USA,English,Frank Pierson,"John Gregory Dunne, Joan Didion",Barwood Films,"Barbra Streisand, Kris Kristofferson, Gary Busey, Oliver Clark, Venetta Fields, Clydie King, Marta Heflin, M.G. Kelly, Sally Kirkland, Joanne Linville, Uncle Rudy, Paul Mazursky, Stephen Bruton, Sammy Lee Creason, Cleve Dupin","A has-been rock star falls in love with a young, up-and-coming songstress.",06.lut,9699,$ 6000000,$ 80000000,$ 80000000,59.0,98.0,47.0,1743685446,48.0,6000000.0
tt0077714,1964: Allarme a N.Y. arrivano i Beatles!,I Wanna Hold Your Hand,1978,1983,"Comedy, Music, Romance",104,USA,English,Robert Zemeckis,"Robert Zemeckis, Bob Gale",Amblin Entertainment,"Nancy Allen, Bobby Di Cicco, Marc McClure, Susan Kendall Newman, Theresa Saldana, Wendie Jo Sperber, Eddie Deezen, Christian Juttner, Will Jordan, Read Morgan, Claude Earl Jones, James Houghton, Michael Hewitson, Dick Miller, Vito Carenzo","In 1964, six teenagers from New Jersey run off to see",06.wrz,4282,$ 2700000,$ 1944682,$ 1944682,64.0,43.0,41.0,1743685446,42.0,2700000.0
tt0087344,Godzilla 1985,Godzilla 1985,1985,1985,"Action, Horror, Sci-Fi",82,Japan,"Japanese, Russian, English","Koji Hashimoto, R.J. Kizer","Reuben Bercovitch, Fred Dekker",Toho Company,"Raymond Burr, Ken Tanaka, Yasuko Sawaguchi, Yôsuke Natsuki, Shin Takuma, Keiju Kobayashi, Eitarô Ozawa, Taketoshi Naitô, Mizuho Suzuki, Junkichi Orimoto, Hiroshi Koizumi, Kei Satô, Takenori Emoto, Sho Hashimoto, Nobuo Kaneko","Thirty years after the original monster's rampage, a new Godzilla emerges and attacks Japan.",06.mar,5874,$ 2000000,$ 4116395,$ 4116395,31.0,69.0,61.0,1743685446,40.0,2000000.0
tt0097523,"Tesoro, mi si sono ristretti i ragazzi","Honey, I Shrunk the Kids",1989,1989,"Adventure, Comedy, Family",93,"USA, Mexico",English,Joe Johnston,"Stuart Gordon, Brian Yuzna",Walt Disney Pictures,"Rick Moranis, Matt Frewer, Marcia Strassman, Kristine Sutherland, Thomas Wilson Brown, Jared Rushton, Amy O'Neill, Robert Oliveri, Carl Steven, Mark L. Taylor, Kimmy Robertson, Lou Cutell, Laura Waterbury, Trevor Galtress, Martin Aylett",The scientist father of a teenage girl and boy accidentally shrinks his and two other neighborhood teens to the size of insects. Now the teens must fight diminutive dangers as the father searches for them.,06.kwi,139632,$ 18000000,$ 130724172,$ 222724172,63.0,95.0,42.0,1743685446,36.0,18000000.0
tt0103247,Zanna Bianca - Un piccolo grande lupo,White Fang,1991,1991,"Adventure, Drama",107,USA,English,Randal Kleiser,"Jack London, Jeanne Rosenberg",Walt Disney Pictures,"Jed, Klaus Maria Brandauer, Ethan Hawke, Seymour Cassel, Susan Hogan, James Remar, Bill Moseley, Clint Youngreen, Pius Savage, Aaron Hotch, Charles Jimmie Sr., Clifford Fossman, Irvin Sogge, Tom Fallon, Dick Mackey",Jack London's classic adventure story about the friendship developed between a Yukon gold hunter and the mixed dog-wolf he rescues from the hands of a man who mistreats him.,06.lip,19198,$ 14000000,$ 34793160,$ 34793160,62.0,37.0,13.0,1743685446,34.0,14000000.0
tt0120176,Il prigioniero,The Spanish Prisoner,1997,1999,"Drama, Mystery, Thriller",110,USA,English,David Mamet,David Mamet,Jasmine Productions Inc.,"Campbell Scott, Ricky Jay, Rebecca Pidgeon, Richard L. Friedman, Ben Gazzara, Jerry Graff, G. Roy Levin, Hilary Hinckle, David Pittu, Steve Martin, Christopher Kaldor, Felicity Huffman, Gary McDonald, Mike Robinson, Olivia Tecosky",An employee of a corporation with a lucrative secret process is tempted to betray it. But there's more to it than that.,07.sty,21543,$ 10000000,$ 9593903,$ 9593903,70.0,271.0,104.0,1743685446,26.0,10000000.0
tt0366444,Fighting Tommy Riley,Fighting Tommy Riley,2004,2006,"Drama, Mystery, Romance",109,USA,"English, Spanish",Eddie O'Flaherty,J.P. Davis,Visualeyes Productions,"Eddie Jones, J.P. Davis, Christina Chambers, Diane Tayler, Paul Raci, Don Wallace, Scot Belsky, Emanuel Zacarias, Carlos Palomino, Michael Bentt, Winston Bailey, Pepper Roach, Eric Brown, Charles 'Chillie' Wilson, Frank McGonagle","An aging trainer and a young fighter, both in need of a second chance, team-up to overcome the demons of their past...and chase the dreams of their future.",06.maj,772,$ 300000,$ 10514,$ 10514,53.0,23.0,12.0,1743685446,19.0,300000.0
tt0439544,Dirty,Dirty,2005,2005,"Crime, Drama, Thriller",97,USA,"English, Spanish",Chris Fisher,"Chris Fisher, Gil Reavill",2710 Inc.,"Frank Alvarez, Clifton Collins Jr., Brittany Daniel, Keith David, Roberto 'Lil Rob' Flores, Aimee Garcia, Cesar Garcia, Nicholas Gonzalez, Cuba Gooding Jr., Kevin Grevioux, Wood Harris, Cole Hauser, Wyclef Jean, Pat Healy, Tory Kittles",Two gangbangers-turned-cops try and cover up a scandal within the LAPD.,05.cze,5225,$ 3000000,$ 274245,$ 274245,37.0,45.0,20.0,1743685446,20.0,3000000.0
tt1194263,Get Low,Get Low,2009,2014,"Drama, Mystery",103,"USA, Germany, Poland",English,Aaron Schneider,"Chris Provenzano, C. Gaby Mitchell",K5 International,"Robert Duvall, Sissy Spacek, Bill Murray, Lucas Black, Gerald McRaney, Bill Cobbs, Scott Cooper, Lori Beth Sikes, Linds Edwards, Andrea Powell, Chandler Riggs, Danny Vinson, Blerim Destani, Tomasz Karolak, Andy Stahl","A movie spun out of equal parts folk tale, fable and real-life legend about the mysterious, 1930s Tennessee hermit who famously threw his own rollicking funeral party... while he was still alive.",7.0,21904,$ 7000000,$ 9176933,$ 10522511,77.0,108.0,163.0,1743685446,11.0,7000000.0


ratings.csv
* Dodaj kolumnę z wartością czasu wykonania notatnika w formacie epoch
* Dla każdego z poniższych wyliczeń nie bierz pod uwagę `nulls` 
* Kto daje lepsze oceny chłopcy czy dziewczyny dla całego setu
* Dla jednej z kolumn zmień typ danych do `long` 

In [0]:
filePath = "dbfs:/FileStore/tables/Files/ratings.csv"
ratingsDf = spark.read.format("csv") \
              .option("header","true") \
              .option("inferSchema","true") \
              .load(filePath)



In [0]:
transformedRatingsDf = ratingsDf \
  .withColumn("epoch_time", F.unix_timestamp(F.current_timestamp())) \
  .dropna() \
  .withColumn("total_votes", F.col("total_votes").cast("long")) \
  .withColumn(
    "higher_rating",
    F.when(F.col("females_allages_avg_vote") > F.col("males_allages_avg_vote"), "Female")
     .when(F.col("males_allages_avg_vote") > F.col("females_allages_avg_vote"), "Male")
     .otherwise("Equal")
  ) \
  .groupBy("higher_rating") \
  .count()
  
display(transformedRatingsDf)


higher_rating,count
Equal,1948
Female,10959
Male,4081
