SparkStreamingDemo

准实时分布式流处理框架，单位以秒来算

Spark入门学习
http://www.cnblogs.com/shishanyuan/p/4699644.html
https://www.cnblogs.com/shishanyuan/p/4747735.html
英文官网：
http://spark.apache.org/docs/latest/streaming-programming-guide.html

SSH的 The authenticity of host xxx.xxx.xxx.xxx can't be established. 问题,authorized_keys的权限为600，不是400:
http://blog.csdn.net/lina791211/article/details/11818825

Hadoop出现namenode running as process 18472. Stop it first.解决方法:
http://www.linuxidc.com/Linux/2014-07/104315.htm

linux 如何把文件夹里面文件权限修改:
https://zhidao.baidu.com/question/139006065.html

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
http://blog.csdn.net/qyl445/article/details/50684691

增加spark worker的内存和datanode的内存方法:
http://blog.csdn.net/wonder4/article/details/52476008

scala.MatchError异常以及StructType Schema设置:
http://blog.csdn.net/gangchengzhong/article/details/70153932
spark streaming向oracle插入日期的问题：
http://www.aboutyun.com/thread-18282-1-1.html

spark 各类算子：
https://www.cnblogs.com/zlslch/p/5723857.html
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html 国外一个牛人的关于Spark RDD API的博客，有详尽的示例

foreachPartitions遍历插入数据,用java编写的连接池，scala编写的有问题：
java编写的连接池可以使用，SparkStreaming程序有问题，可以参见第三个连接:
http://blog.csdn.net/erfucun/article/details/52312682
scala编写的连接池有问题，SparkStreaming程序正常:
http://blog.csdn.net/legotime/article/details/51836039
oracle ORA-01000: maximum open cursors exceeded问题的解决方法:
https://www.cnblogs.com/qinjunli/p/4588089.html

oracle批量插入数据操作:
http://blog.csdn.net/yuanzexi/article/details/50912630

spark 连接池c3p0使用:
http://www.jianshu.com/p/65c1b319b70a

scala中 object 和 class的区别:
http://blog.csdn.net/wangxiaotongfan/article/details/48242029

scala基础语法学习：
http://www.yiibai.com/scala/scala_overview.html

spark读取配置文件中的配置
http://blog.csdn.net/u012307002/article/details/53308937

Spark Streaming使用Kafka保证数据零丢失：
https://www.cnblogs.com/jacksu-tencent/p/5135869.html
https://www.jianshu.com/p/8603ba4be007
http://blog.csdn.net/lsshlsw/article/details/51133217
https://www.iteblog.com/archives/1591.html

Spark Streaming消费Kafka Direct方式数据零丢失实现：
https://www.cnblogs.com/hd-zg/p/6841249.html
: Spark streaming接收Kafka数据, 偏移量记录的方式有checkpoint、数据库或文件记录或者回写到zookeeper中进行记录:
https://www.cnblogs.com/xlturing/p/6246538.html

spark Direct 偏移量保存在zookeeper上 https://github.com/xlturing/spark-journey/blob/master/SparkStreamingKafka/src/main/scala/com/sparkstreaming/main/KafkaManager.scala
http://blog.csdn.net/lw_ghy/article/details/50926855

Spark Streaming -2. Kafka集成指南（Kafka版本0.10.0或更高版本）
http://blog.csdn.net/zhongguozhichuang/article/details/53282858
Spark2.x学习笔记：1、Spark2.2快速入门（本地模式）
http://blog.csdn.net/chengyuqiang/article/details/77671748?locationNum=4&fps=1
Spark2.11 两种流操作 + Kafka
http://blog.csdn.net/zeroder/article/details/73650731

foreachPartition和mapPartition的区别，一个Transformation运算，一个action运算：
http://blog.csdn.net/u010454030/article/details/78897150

spark心跳与集群机器时间有关，如果集群每台机器时间不一致，会导致spark心跳失衡，故而报错

Kafka+Spark Streaming+Redis实时计算整合实践：
http://dataunion.org/17837.html

整合Kafka到spark-streaming实例:
https://cloud.tencent.com/developer/article/1017077

spark streaming解析json
https://blog.csdn.net/weixin_35040169/article/details/80057561
https://github.com/json4s/json4s

spark streaming的业务适用场景

参考资料：
https://cloud.tencent.com/developer/article/1106470
https://hacpai.com/article/1499918046636
实时推荐、用户行为分析等

当分配2个线程跑程序时，spark运行界面跑到Executor行时停止不动：
原因：线程阻塞，分配线程太少，至少分配3或4个以上，本人以4为准，如：

val sparkSession = SparkSession
.builder()
.master("local[4]")
.appName(appName)
.getOrCreate()

sparkSession使用报错

“Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. val lines = insiDE.selectExpr("CAST(value AS STRING)").as[String]” 报错
https://stackoverflow.com/questions/38664972/why-is-unable-to-find-encoder-for-type-stored-in-a-dataset-when-creating-a-dat
添加“import sparkSession.implicits.” 而不是“import spark.implicits.”

spark报错 task not serializable

注意问题：在spark的main主程序中调用外部类，外部类都需要序列化
http://www.aboutyun.com/thread-19464-1-1.html

同一个SparkStreaming任务中同时处理多个topic消息，每个topic中处理逻辑都不同，如何高效实现

http://xuexi.edu360.cn/question/1005.html

Spark on Yarn 日志过大解决方案

https://blog.csdn.net/chen20111/article/details/82798207
https://www.jianshu.com/p/0fe51185eeba
https://www.iteblog.com/archives/1353.html

spark submit打成的jar包如何修改jar包里面的config.properties配置文件

手动修改方式：

spark 报错：No Route to Host from localhost.localdomain/127.0.0.1 to master:9000 failed on socket timeout exception

http://www.zhangyitian.cn/blog/%E8%A7%A3%E5%86%B3spark-shell%E5%90%AF%E5%8A%A8%E6%8A%A5%E9%94%99%EF%BC%9Ano-route-to-host-from-localhost-localdomain127-0-0-1-to-master9000-failed-on-socket-timeout-exception/
服务器里面有hadoop的配置文件（不管在哪），spark都会读取到，gg，删除hadoop的配置文件就行

structured-streaming

structured-streaming的使用

spark-streaming与structured-streaming的关联：
http://www.sohu.com/a/111860912_116235

structured-streaming foreach模式写入到数据库中:
https://www.cnblogs.com/huliangwen/p/7470705.html 只是提了个点子，没有详细说明
https://www.jianshu.com/p/0a65a248c6a8 将数据写入到mysql中
https://toutiao.io/posts/8vqfdo/preview 将数据写入到内存或者mysql或者kafka中，需要重写foreachwriter类

structured-streaming读取kafka中的json数据

报错“Caused by: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;”
报错原因是在maven的pom.xml文件中配置了如下内容：

org.json4s
json4s-native_2.11
3.6.0

org.json4s
json4s-jackson_2.11
3.6.0

上述内容无需再次配置，配置spark的jar包已嵌套配置上述jar包 structured streaming读取kafka中的json数据参考资料：
这里针对的只是一个json对象，如json范例：

{"deviceId":"ab123","imsi":"123456789","catchTime":1537450341,"regional":"广东省深圳市","createTime":1537450640}
解决方案如下：
https://blog.csdn.net/weixin_35040169/article/details/80057561
https://github.com/json4s/json4s/

如果传过来的是一个json数组，如json范例：

[{"deviceId":"ab123","imsi":"123456789","catchTime":1537450341,"regional":"广东省深圳市","createTime":1537450640},{"deviceId":"ab123","imsi":"123456789","catchTime":1537450341,"regional":"广东省深圳市","createTime":1537450640}]
解决方案如下，通过添加一个List：
val query1 = df
.selectExpr( "CAST(value AS STRING)") //对字段进行UDF操作，并返回该列
.as[(String)]
.map(value => {
println(value)
//隐式转换，使用json4s的默认转化器
implicit val formats: DefaultFormats.type = DefaultFormats
val json = parse(value)
json.extract[List[Record]]
})
.writeStream
.outputMode(outputMode)
.foreach(new RecordWriter())
.trigger(Trigger.ProcessingTime(processingTime)) //设置5秒为一次批处理查询
.start()

structured streaming 丢失的参数

不能配置的参数

group.id: 对每个查询，kafka 自动创建一个唯一的 group
auto.offset.reset: 可以通过 startingOffsets 指定，Structured Streaming 会对任何流数据维护 offset, 以保证承诺的 exactly once.
key.deserializer: 在 DataFrame 上指定，默认 ByteArrayDeserializer
value.deserializer: 在 DataFrame 上指定，默认 ByteArrayDeserializer
enable.auto.commit:
interceptor.classes:
https://www.cnblogs.com/huzuoliang/p/7070728.html

【参考资料】:

kafka偏移量超过范围，https://www.jianshu.com/p/157fd4a73b3a

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
src/main		src/main
.gitignore		.gitignore
README.md		README.md
Spark加载内外部配置文件.md		Spark加载内外部配置文件.md
nginx.log		nginx.log
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SparkStreamingDemo

spark streaming的业务适用场景

sparkSession使用报错

scala在spark中使用log4j报不能序列化

Spark 判断Rdd是否为空

sparkStreaming数据倾斜解决方案

Spark参数调优

Spark异常捕获处理

spark scala处理json4s json4s使用指南

DataSet数据集在使用sql()时，无法使用map，flatMap等转换算子的解决办法

spark报错 task not serializable

同一个SparkStreaming任务中同时处理多个topic消息，每个topic中处理逻辑都不同，如何高效实现

Spark on Yarn 日志过大解决方案

spark submit打成的jar包如何修改jar包里面的config.properties配置文件

spark 报错：No Route to Host from localhost.localdomain/127.0.0.1 to master:9000 failed on socket timeout exception

structured-streaming

structured-streaming的使用

structured-streaming读取kafka中的json数据

structured streaming 丢失的参数

About

Uh oh!

Releases

Packages

Languages

myGingko/SparkStreamingDemo

Folders and files

Latest commit

History

Repository files navigation

SparkStreamingDemo

spark streaming的业务适用场景

sparkSession使用报错

scala在spark中使用log4j报不能序列化

Spark 判断Rdd是否为空

sparkStreaming数据倾斜解决方案

Spark参数调优

Spark异常捕获处理

spark scala处理json4s json4s使用指南

DataSet数据集在使用sql()时，无法使用map，flatMap等转换算子的解决办法

spark报错 task not serializable

同一个SparkStreaming任务中同时处理多个topic消息，每个topic中处理逻辑都不同，如何高效实现

Spark on Yarn 日志过大解决方案

spark submit打成的jar包 如何修改jar包里面的config.properties配置文件

spark 报错：No Route to Host from localhost.localdomain/127.0.0.1 to master:9000 failed on socket timeout exception

structured-streaming

structured-streaming的使用

structured-streaming读取kafka中的json数据

structured streaming 丢失的参数

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

spark submit打成的jar包如何修改jar包里面的config.properties配置文件

Packages